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INTRODUCTION 

In  a  January  1995  report  entitled  "Contingent  Valuation  of  Natural 
Resource  Damages  Due  to  Injuries  to  the  Upper  Clark  Fork  River  Basin, "  a  team 
from  RCG/Hagler  Bailly  reported  the  results  of  a  study  they  had  conducted  to 
value,  in  monetary  terms,  complete  and  partial  cleanup  of  hazardous  waste 
sites  in  the  Clark  Fork  Basin.   This  report  provides  a  peer  review  of  their 
study.   For  convenience,  I  will  refer  to  it  as  the  Clark  Fork  study. 

My  peer  review  will  focus  on  the  validity  of  the  Clark  Fork  study.   The 
term  "validity"  as  used  here  is  synonymous  with  what  others,  including  the 
NOAA  Panel,  have  termed  "reliability."   Either  term  refers  to  the  accuracy  of 
the  results.   The  issue  is  whether  the  results  from  the  Clark  Fork  study  are 
sufficiently  valid  to  be  used  in  estimating  the  damages  to  Clark  Fork  Basin 
resources  from  releases  of  hazardous  substances  at  the  sites  in  question. 

Assessing  the  validity  of  any  economic  value  measure,  including  measures 
from  contingent  valuation  studies,  is  problematical  because  true  values  are 
unobservable .   True  values  exist  only  in  economic  theory.   Theorists  think  in 
terms  of  the  economic  well-being  or  "utility"  enjoyed  by  individuals. 
Individuals  enjoy  utility  to  the  extent  that  their  preferences  are  satisfied. 
However,  since  preferences  are  not  directly  observable,  true  values  cannot  be 
observed.   Instead,  economists  rely  on  evidence  such  as  market  prices  and 
contingent  valuation  results  to  attempt  to  infer  something  about  economic 
values.   Judging  the  accuracy  of  such  inferences  is  a  complicated  business, 
however,  since  true  values  cannot  be  observed  and  used  as  a  standard  for 
comparison. 

This  is  not  an  unusual  problem  for  the  social  sciences.   For  instance, 
such  concepts  as  intelligence  and  proficiency  in  mathematics  are  as 
unobservable  as  willingness  to  pay.   IQ  tests  and  math  exams  provide  evidence 
about  intelligence  and  math  proficiency,  respectively.   In  a  similar  vein, 
market  demand  analyses,  contingent  valuation  studies,  and  other  methods  are 
used  to  estimate  economic  values.   In  all  these  cases,  the  question  becomes, 
How  good  (how  valid?  how  reliable?)  are  the  observed  values  as  indicators  of 
the  unobservable  true  values? 
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Three  mutually  reinforcing  approaches  have  evolved  in  the  social 
sciences  to  address  this  question.   Content  validity  assessment  involves 
examination  of  measurement  procedures  using  standards  based  on  theory,  past 
experience,  and  common  sense.   Construct  validity  assessment  involves  testing 
theory-based  hypotheses  about  the  relationships  between  estimates  of  true 
values  and  other  variables.   Criterion  validity  involves  comparing  estimates 
arrived  at  using  the  measurement  technique  being  evaluated  to  other  indicators 
of  the  same  true  value  that  are  arguably  closer  to  the  true  value.  Although 
true  values  cannot  be  observed  directly,  if  opportunities  are  available  to 
estimate  true  values  in  ways  that  have  high  scientific  credibility,  then  such 
estimates  can  serve  as  standards  for  comparison,  or  criteria,  for  evaluating 
the  validity  of  the  technique  in  question. 

Each  of  these  approaches  to  validity  assessment  can  be  applied  to 
contingent  valuation.   Content  validity  assessment  involves  examining  the 
procedures  that  were  followed  in  designing  and  executing  the  contingent 
valuation  survey  and  analyzing  the  results.   Poor  procedures  may  lead  to 
inaccuracy.   Sound  procedures  should  increase  the  likelihood  that  observed 
values  are  valid. 

Construct  validity  assessment  asks  whether  or  not  the  relationships 
between  contingent  values  and  other  variables,  including  other  contingent 
values,  are  consistent  with  principles  from  the  theory  of  value.   Consistency 
supports  interpreting  contingent  values  as  valid  estimates  of  true  values. 
Inconsistency  raises  doubts  about  such  an  interpretation. 

Progress  in  criterion  validity  research  requires  situations  where  the 
same  good,  service,  or  environmental  amenity  can  be  valued  using  contingent 
valuation  and  some  other  method  that  is  generally  accepted  to  be  at  least  as 
accurate  as  the  contingent  value.   Contingent  valuation  involves  transactions 
(broadly  defined  to  include  activities  like  voting  in  referenda  on  issues  of 
service  provision  and  taxation)  that  are  fundamentally  hypothetical .   Most 
economists  would  agree  that  real  transactions  for  the  same  good,  service,  or 
amenity  should  yield  values  that  are  at  least  as  valid  as  contingent  values. 
Thus  in  criterion  validity  research,  results  from  real  transactions  of  some 
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sort  generally  serve  as  the  standard  against  which  to  evaluate  the  validity  of 
contingent  values. 

Content  and  construct  validity  tests  are  sometimes  referred  to  as 
internal  tests,  because  they  involve  evidence  from  within  a  study.   Criterion 
validity,  on  the  other  hand,  involves  external  tests  in  the  sense  that  it 
involves  extending  the  results  from  a  criterion  validity  study  to  studies 
using  only  contingent  valuation.   If  contingent  valuation  worked  well  when 
evaluated  against  a  more  valid  measure  of  value  in  one  application,  then  this 
increases  confidence  that  it  will  work  well  in  other  applications. 

A  more  complete  exposition  of  the  problem  of  contingent  valuation 
validity  assessment  and  how  each  of  these  approaches  addresses  that  problem  is 
provided  in  Appendix  A  of  this  review.   This  is  a  paper  by  Bishop,  Champ, 
Brown,  and  McCollum  presented  last  summer  at  an  international  conference  on 
contingent  valuation. 

As  the  paper  in  Appendix  A  points  out,  the  overall  validity  of  the 
contingent  valuation  method  is  still  a  matter  of  intense  debate.   This  debate 
continues  because  results  from  validity  assessments  have  so  far  been  mixed. 
Sometimes  contingent  valuation  seems  to  perform  well,  while  in  other 
applications  it  seems  to  perform  poorly.   One  conclusion  is  evident:   It  is 
possible,  indeed  easy,  to  perform  poor  contingent  valuation  studies.   As  the 
overall  validity  of  the  method  continues  to  be  debated,  increasing  attention 
is  being  focused  on  the  factors  that  distinguish  "good"  studies  from  "bad" 
studies.   Stated  differently,  the  debate  over  validity  is  focusing  more  and 
more  on  internal  tests,  tests  at  the  level  of  the  individual  study.   It  is 
clear  that  no  single  study,  or  even  a  small  number  of  studies,  will  be 
sufficient  to  judge  the  overall  validity  of  the  contingent  valuation  method. 
It  is  also  clear  that  too  many  weak  studies  have  come  forward  claiming  to  show 
that  contingent  valuation  in  general  does  or  does  not  work  well.   If 
contingent  valuation  has  the  potential  to  work  well,  it  is  most  likely  to  do 
so  in  studies  exhibiting  high  levels  of  content  and  construct  validity. 
Studies  that  are  poorly  designed  and  executed  (i.e.,  have  low  content 
validity)  and  perform  poorly  in  construct  validity  testing  are  least  likely  to 
produce  valid  estimates  of  true  values  and  are  least  relevant  in  achieving  the 
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ultimate  goal  of  defining  those  conditions  where  contingent  valuation  is  the 
most  (and  least)  promising. 

This  assessment  of  the  Clark  Fork  study  will  be  limited  to  content  and 
construct  validity.   In  particular,  little  attention  will  be  given  to 
criterion  validity.   A  criterion  validity  assessment  in  the  Clark  Fork  case 
would  involve  trying  to  argue  that,  because  contingent  valuation  worked  well 
(or  poorly)  in  some  specific  criterion  validity  study  having  nothing  directly 
to  do  with  the  Clark  Fork  application,  it  therefore  must  have  worked  well  (or 
poorly)  for  the  Clark  Fork  study.   Such  conclusions  are  not  possible  at  this 
point  in  time  for  two  reasons.   First,  if  contingent  valuation  consistently 
performed  well  (or  poorly)  in  criterion  validity  studies,  then  there  might  be 
a  basis  to  extend  to  the  Clark  Fork  study.   However,  so  far,  the  results  from 
such  studies  are  mixed  (relevant  citations  are  given  in  Appendix  A) . 
Sometimes  contingent  valuation  appears  to  work  fairly  well;  other  times  it 
does  not.   Second,  no  study,  or  set  of  studies,  is  sufficiently  analogous  to 
the  Clark  Fork  study  to  form  a  basis  for  judging  whether  contingent  valuation 
worked  well  there.   Because  of  the  inability  to  carry  out  a  criterion  validity 
assessment,  this  review  will  be  based  on  an  examination  of  the  procedures 
applied  in  the  Clark  Fork  study  and  tests  for  consistency  between  the  Clark 
Fork  results  and  economic  theory. 

In  the  next  section,  I  will  examine  the  content  validity  of  the  Clark 
Fork  study.   Then  I  will  turn  to  the  efforts  of  the  investigators  to  test 
theory-driven  hypotheses  about  their  results,  thus  testing  the  construct 
validity  of  their  application.   Though  the  terminology  may  vary,  the  criteria 
I  apply  in  evaluating  the  Clark  Fork  study  provide  a  great  deal  of  overlap 
with  the  recommendations  of  the  NOAA  Panel  and  the  rules  of  contingent 
valuation  damage  assessments  proposed  by  NOAA  and  the  Department  of  the 
Interior . 

As  the  review  proceeds,  it  will  become  increasingly  clear  that  validity 
assessment  is,  to  a  large  extent,  a  matter  of  professional  judgement. 
Reviewers  of  contingent  valuation  studies  bring  their  training  and  experience 
to  bear  in  trying  to  judge  whether  value  estimates  are  interpretable  as 
adequate  estimates  of  true  values  given  the  current  state-of-the-art.   To  try 
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to  place  my  judgements  about  the  Clark  Fork  study  in  perspective,  the  third 
section  will  compare  my  evaluation  of  it  against  my  evaluation  of  two  other 
studies.   The  fourth  section  will  discuss  some  aspects  of  the  scientific 
status  of  the  contingent  valuation  method. 

Before  I  turn  to  the  content  validity  assessment,  the  boundaries  of  my 
review  need  to  be  explicitly  stated.   I  have  not  reviewed  the  completed 
questionnaires  or  the  data  files.   I  have  taken  at  face  value  the  results  of 
the  statistical  analyses  performed  as  they  are  described  in  the  report.   I 
have  not  attempted  to  replicate  the  statistical  procedures  followed  in 
arriving  at  those  results.  Also,  I  have  not  reviewed  the  various  reports 
covering  injuries  at  the  Clark  Fork  sites  or  attempted  to  compare  the  injuries 
described  in  the  contingent  valuation  surveys  with  the  injuries  described  in 
those  reports.   Finally,  my  review  stops  with  the  estimated  values  for  partial 
and  complete  cleanup  at  the  Clark  Fork  sites  and  will  not  include  procedures 
followed  in  the  estimation  of  damages  over  time. 


CONTENT  VALIDITY  ASSESSMENT 

Students  of  contingent  valuation  often  conclude  that  one  contingent 
valuation  study  is  better  than  another.   In  part,  such  a  judgment  would  be 
based  on  an  examination  of  how  the  studies  were  designed  and  executed.   In  an 
effort  to  make  the  criteria  that  are  applied  in  formulating  such  judgments 
more  explicit  and  systematic,  Bishop  and  McCollum,  in  a  draft  paper  presented 
in  Appendix  B  of  this  review,  developed  a  set  of  criteria  that  should  be 
applied  in  assessing  the  content  validity  of  contingent  valuation  studies. 
These  criteria  are  stated  in  the  form  of  questions  that  appear  as  Table  1  in 
the  paper.   My  co-author  and  I  are  proposing  that  reviewers  of  contingent 
valuation  studies  express  their  answers  to  the  questions  verbally  and,  except 
for  the  first  question,  in  terms  of  a  numerical  score.   The  scores  are  to 
express  the  extent  to  which  the  study  under  review  meets  the  criteria  in  each 
case.   As  I  evaluate  the  Clark  Fork  study  and  eventually  two  other  studies  as 
well,  I  will  first  present  my  answer  to  each  question  and  then  assign  a  score. 
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1.  Do  the  study  procedures  contain  flaws  so  serious  that  thev  would  rule  out 
the  use  of  the  results  to  achieve  study  goals? 

While  I  have  raised  several  questions  about  the  procedures  followed  in 

the  study,  I  did  not  find  any  of  the  potential  flaws  to  be  fatal.   No  points 

are  assignable  for  this  question.   Either  a  study  has  fatal  flaws  or  it  does 

not . 

2.  Was  the  true  value  defined? 

The  theoretical  basis  for  this  study  is  provided  in  Section  1.2  of  the 
report  and  in  formal  definitions  of  the  values  estimated,  which  appear  on 
pages  5-23  and  5-24  of  the  report.   Equation  5-2  defines  willingness  to  pay  as 
the  income  change  required  to  exactly  offset  the  utility  gain  from  an 
improvement  in  the  environmental  amenities  of  the  Clark  Fork  NPL  sites.   This 
definition  of  value  causes  no  particular  difficulty  for  cases  in  which  a 
certain  and  rather  immediate  change  in  the  status  of  the  environmental 
resources  is  to  be  valued.   In  the  Clark  Fork  study,  the  formal  definition  of 
value  does  not  explicitly  account  for  the  uncertainty  or  the  timing  of  any 
changes  in  the  environmental  attributes  if  a  cleanup  is  undertaken. 

The  issue  of  uncertainty  may  not  play  a  critical  role  in  this  study.   As 
part  of  the  scenarios,  respondents  were  provided  with  descriptions  of  what 
would  be  accomplished  under  the  various  proposed  interventions.1   These 
effects  were  simply  described  without  any  sense  of  uncertainty  attached  to 
them.   Perhaps  of  more  concern  is  the  issue  of  timing.   Willingness-to-pay  is 
expressed  in  terms  of  an  annual  payment  during  each  of  the  next  ten  years . 
The  connection  between  the  time  path  of  costs  (the  annual  payments  for  ten 
years)  and  the  time  path  of  benefits  of  the  intervention  is  not  dealt  with 
theoretically. 

Another  theoretical  concern  focuses  on  the  definition  of  damages. 
Complete  cleanup  is  used  as  the  baseline  for  calculating  damages  and  damages 
are  taken  as  the  "residual  value,"  the  difference  between  the  value  of 
complete  cleanup  and  the  value  of  partial  cleanup.   The  true  baseline 


intervention  is  used  in  this  report  to  signify  actions  of  government, 
such  as  cleaning  up  NPL  sites,  that  may  generate  values. 
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condition  of  the  resources  is  the  condition  that  would  have  prevailed  had  the 
injuries  never  occurred.   Since  complete  cleanup  takes  time,  one  would  expect 
its  value  to  be  less  than  the  value  the  resources  would  have  generated  had 
they  never  been  injured  (the  theoretical  value  of  "true  damages").   Therefore, 
residual  value  as  defined  in  the  Clark  Fork  study  (the  value  of  complete 
cleanup  minus  the  value  of  partial  cleanup)  will  be  less  than  true  damages 
(the  value  of  uninjured  resources  minus  the  value  of  partial  cleanup) ,  all 
else  equal.   The  report  does  not  consider  this  issue  nor  does  it  point  out 
that  residual  value  underestimates  true  damages,  all  else  equal. 

One  other  theoretical  aspect  of  the  damage  calculation  needs  to  be 
considered:   Apportionment  was  carried  out  to  allocate  estimated  total  values 
among  various  resource  categories.   This  apportionment  was  accomplished  using 
respondents'  reports  about  the  relative  percent  of  total  willingness  to  pay 
attributable  to  each  of  the  various  resource  categories.   Such  apportionment 
questions  are  not  only  difficult  for  respondents  to  deal  with,  but  also  pose 
some  theoretical  difficulties.   The  value  of  any  one  of  the  components  is 
theoretically  dependent  on  the  levels  of  the  others. 

In  view  of  these  theoretical  concerns,  I  assigned  3  out  of  the  possible 
5  points  on  this  item.   However,  I  would  quickly  add  that,  while  I  might  have 
some  theoretical  qualms  about  apportionment,  I  am  not  aware  of  any  methods 
that  are  both  practical  and  theoretically  sound.   There  is  precedent  in  the 
existing  literature  for  what  was  done  to  apportion  values  in  the  Clark  Fork 
study.   Although  questions  arise  from  a  theoretical  perspective,  the  authors 
appear  to  have  applied  a  reasonable  practical  expedient  to  sidestep  a 
theoretically  insoluble  problem. 

3.   Were  the  environmental  attributes  relevant  to  potential  subjects  fully 
identified? 

The  Clark  Fork  study  devoted  substantial  effort  to  qualitative  research, 

including  both  verbal  protocols  and  self -administered  pretests.   Although  the 

researchers  used  different  terms,  what  they  did  went  a  substantial  way  toward 

identifying  participant-relevant  attributes  of  the  resources  in  question. 

They  speak  (p.  2-16)  in  terms  of  "a  process  of  acquiring  a  substantial  amount 

of  potentially  relevant  information  about  natural  resource  injuries,  both  in 
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general  and  at  the  Clark  Fork  sites  in  specific,  and  formally  paring  down  the 
information  to  retain  the  most  critical  information  that  respondents  use  to 
evaluate  the  sites  and  to  form  values."   In  the  process  of  attempting  to  pare 
down  the  amount  of  information  presented  in  the  survey  instrument,  they 
presumably  would  have  learned  a  great  deal  about  which  resources  respondents 
consider  relevant. 

However,  I  still  assigned  only  7  points  out  of  a  possible  10  to  this 
aspect.   The  qualitative  research  performed  in  designing  this  study  supports 
the  use  of  the  information  that  was  provided  (see  Table  C-l,  for  example) . 
Nevertheless,  it  is  not  difficult  to  think  of  examples  of  potentially  relevant 
attributes  that  apparently  were  not  considered.   As  one  example,  the  final 
survey  instrument  stated  that  trout  have  been  eliminated  from  Silver  Bow  Creek 
and  that  trout  populations  in  the  Clark  Fork  River  between  Warm  Springs  Ponds 
and  Milltown  Reservoir  have  been  reduced  to  about  one- fourth  of  the  population 
that  would  be  present  if  the  contamination  had  not  occurred.   The  survey  did 
not  provide  any  information  about  the  numbers  of  trout  that  this  might 
represent,  either  in  absolute  terms  or  relative  to  other  fish  habitats  in 
Montana . 

Similarly,  when  describing  terrestrial  impacts,  the  final  survey  simply 
described  the  areal  extent  of  affected  habitat.   For  example,  regarding 
contaminated  soils,  respondents  were  informed  of  a  loss  of  20  square  miles  of 
habitat  used  by  elk,  deer,  pine  martins,  grouse,  redtail  hawks,  and  common 
songbirds.   It  did  not  provide  information  about  whether  this  has  resulted  in 
any  reductions  in  wildlife  populations  that  might  utilize  this  habitat  nor  did 
it  place  the  loss  of  wildlife  (if  any)  into  any  sort  of  context. 

The  focus  on  areas  of  habitat  rather  than  wildlife  populations  may  be  of 
no  consequence.   It  might  have  been  established  during  the  qualitative 
research  that  habitat,  rather  than  wildlife  populations,  was  the  critical 
piece  of  information  required  by  respondents.   Similarly,  respondents  might 
have  cared  about  wildlife  populations  and  possessed  sufficient  knowledge  about 
the  consequences  of  the  loss  of  habitat  for  the  species  involved.   In  either 
instance,  the  lack  of  detail  about  wildlife  populations  would  not  be  a 
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significant  issue.   But  leaving  such  loose  ends  reduces  the  content  validity 
of  the  study. 

In  a  similar  vein,  there  was  no  explicit  discussion  of  the  number  of 
humans  who  are  affected  by  the  contamination  of  groundwater  or  the  nature  and 
extent  of  the  adverse  effects.   The  survey  instrument  indirectly  indicated 
that  the  water  supply  for  the  city  of  Butte  has  been  contaminated  and  has  had 
to  be  replaced.   It  specifically  mentioned  that  the  residents  of  Mill town  have 
also  had  to  replace  their  water  supply  because  of  groundwater  contamination. 
However,  the  respondent  was  left  to  guess  at  the  extent  of  the  effects  now  and 
in  the  future . 

Arguing  along  the  same  lines  as  for  wildlife,  the  lack  of  detail  about 
the  effects  of  groundwater  contamination  may  be  irrelevant.   Respondents  may 
simply  have  valued  groundwater  resources  regardless  of  whether  or  not  those 
resources  are  used  as  a  water  supply.   Furthermore,  even  if  the  human-use 
component  of  groundwater  was  important  to  willingness  to  pay,  respondents  to 
the  survey  might  have  been  able  to  guess  the  number  of  affected  households 
with  sufficient  precision  to  support  valuation.   However,  the  study  would  be 
stronger  if  these  aspects  had  been  examined. 

4.   Were  the  potential  effects  of  the  intervention  on  environmental  attributes 
and  other  economic  parameters  adequately  documented  and  communicated? 

The  report  on  the  Clark  Fork  study  points  out  that  the  researchers 
worked  with  those  assessing  the  injury  to  document  how  attributes  were 
affected  by  the  releases.   This  effort  should  have  been  adequate  to  document 
such  effects. 

However,  regarding  the  communication  of  the  effects,  the  instrument 
required  that  respondents  read  and  absorb  large  blocks  of  information.   This 
could  have  left  some  respondents  less  than  well  informed  simply  because  they 
lacked  the  ability  to  absorb  the  information  provided  and  therefore  could  not 
use  it  in  arriving  at  their  values.   This  is  a  sufficiently  serious  concern  in 
my  view  to  assign  only  6  out  of  10  points  on  this  item. 
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5.   Were  respondents  aware  of  the  existence  and  status  of  environmental 
substitutes? 

The  survey  instrument  did  not  explicitly  remind  respondents  to  consider 
the  levels  of  either  substitutes  for,  or  complements  to,  the  interventions 
being  valued.   On  the  other  hand,  the  instrument  did  provide  many  indirect 
references  to  potential  substitutes.   For  example,  it  identified  and  briefly 
described  all  of  the  National  Priority  List  (NPL)  sites  in  the  state  of 
Montana  and  the  total  number  of  other  sites  at  which  contamination  may  have 
occurred.   In  addition  to  identifying  some  possible  environmental  substitutes, 
the  survey  instrument  also  asked  survey  respondents  to  rank  the  importance  of 
several  problems  facing  society  at  large.   The  positioning  of  this  question 
prior  to  the  valuation  question  may  well  have  served  to  remind  respondents  of 
a  broader  range  of  social  problems  that  might  require  substantial  resources  to 
solve . 

Finally,  it  is  helpful  to  remember  at  least  part  of  the  motivation  for 
being  concerned  about  substitutes  in  damage  assessments.   The  fear  on  the  part 
of  the  NOAA  Panel,  as  well  as  others  who  have  critically  reviewed  the 
contingent  valuation  method,  was  that  respondents  will  get  the  impression  that 
the  injuries  are  more  widespread  and  injurious  than  is  true  in  reality.   For 
example,  in  evaluating  an  oil  spill,  respondents  might  get  the  idea  that  the 
spill  affected  vast  areas  and  killed  a  large  proportion  of  the  bird  and  other 
marine  life  of  the  region  when  this  is,  in  fact,  not  the  case.   If  so, 
respondents  would  need  to  be  reminded  that  large  areas  of  the  coast  and 
associated  wildlife  populations  were  unaffected.   The  Clark  Fork  study  may 
have  protected  itself  from  this  sort  of  possibility  by  limiting  its  sampling 
frame  to  Montana.   One  would  suspect  that  residents  of  Montana  are  generally 
aware  that  very  large  areas  of  their  state  are  relatively  pristine  and  have 
not  been  affected  by  releases  of  toxic  substances.   Most  Montanans,  for 
example,  would  probably  realize  that  the  20  square  miles  of  wildlife  habitat 
identified  in  the  survey  as  being  affected  by  contamination  from  the  Clark 
Fork  NPL  sites  represents  only  a  very  small  portion  of  the  total  wildlife 
habitat  available  in  western  Montana. 

This  is  not  to  say  that  the  researchers  did  all  they  might  have  done 
about  substitutes.   For  example,  enhanced  public  access  to  other  rivers  might 
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serve  as  a  substitute  for  cleaning  up  the  aquatic  resources  in  Silver  Bow 
Creek.   Perhaps  the  survey  instruments  could  have  done  a  bit  more  to  remind 
survey  respondents  about  a  broader  range  of  environmental  substitutes. 
Whether  such  reminders  would  have  significantly  affected  values  remains  sheer 
speculation.   Given  that  Montanans  are  likely  to  realize  the  affected 
resources  represent  only  a  small  part  of  Montana  resources,  I  would  judge  that 
this  study's  efforts  to  remind  subjects  about  substitutes  were  quite 
satisfactory  and  warrant  assigning  it  4  out  of  the  5  possible  points  on  this 
item. 

6.   Was  the  budget  constraint  adequately  stressed? 

Individuals  were  not  explicitly  reminded  of  their  budget  constraints  in 
the  survey  instruments.   However,  I  do  not  view  this  as  a  major  flaw.   The 
discussion  of  additional  cleanup  sites  (for  example,  the  NPL  sites  not  in  the 
Clark  Fork  Basin)  and  the  discussion  of  other  non-environmental  programs  must 
have  served  to  heighten  respondents'  awareness  of  the  fact  that  there  are  many 
competing  demands  for  limited  resources.   Likewise,  the  second  question  in  the 
survey  asked  respondents  to  rank  the  relative  importance  of  dealing  with  a 
series  of  problems.   While  the  researchers  indicated  that  this  question  was 
inserted  at  the  beginning  of  the  survey  in  an  attempt  to  ". . .diffuse  any 
importance  bias  to  a  specific  topic  that  may  result  by  receiving  a  survey  on  a 
specific  topic...,"  the  question  may  also  have  served  to  remind  respondents 
about  other  social  issues  that  could  require  additional  resources  if 
addressed. 

In  my  own  work,  I  have  found  that  respondents  often  spontaneously 
identify  budget  constraints,  or  income,  when  asked  how  they  determined  their 
responses  to  a  willingness-to-pay  question.   In  the  Clark  Fork  study, 
respondents  were  encouraged  to  provide  written  comments  and  nearly  75  percent 
of  them  did  so.   The  report  stated  (p.  5-18),  "The  most  prevalent  type  of 
comment  (258  made)  was  simply  an  affirmation  of  the  WTP  response.   These 
comments  frequently  referenced  the  respondent's  income  level  as  a  determinant 
of  WTP . " 
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I  assigned  the  Clark  Fork  study  4  out  of  5  points  for  adequately 
stressing  budget  constraints 

7.   Was  the  context  for  valuation  fully  specified  and  incentive  compatible? 

The  context  for  valuation  in  this  study  seems  to  have  been  generally 
adequate.   Respondents  were  asked  to  indicate,  using  a  payment  card,  the 
maximum  amount  they  would  be  willing  to  pay  each  year  for  the  next  ten  years. 
The  payment  would  be  used  to  finance  the  cleanup  of  the  Clark  Fork  NPL  sites. 
Respondents  were  informed  that  all  members  of  society  would  have  to  pay  part 
of  the  cost  of  cleaning  up  these  sites,  and  that  private  industry  and  agencies 
of  the  U.S.  Government  were  already  paying  for  part  of  the  cleanup. 
Mentioning  payments  by  entities  who  caused  the  contamination  is  of  particular 
importance.   In  my  ongoing  work,  I  have  seen  respondents  to  contingent 
valuation  questions  tend  to  report  lower  WTP  values  when  they  believed  that 
the  entities  causing  an  environmental  problem  were  not  being  required  to  share 
the  cost  of  fixing  the  problem. 

An  unusual  feature  of  the  context  in  this  study  was  the  use  of  multiple 
payment  vehicles.   Just  prior  to  the  valuation  question,  survey  respondents 
were  asked  to  evaluate  the  acceptability  of  various  payment  vehicles  that 
could  be  used  to  collect  the  money  required  for  cleanup.   Respondents  were 
asked  to  assume  that  their  stated  willingness  to  pay  would  be  collected  using 
one  or  more  of  the  vehicles  identified  as  acceptable  by  the  respondent  in  this 
earlier  question.   This  approach  was  taken  in  order  to  reduce  protest  bids 
emanating  from  undesirable  payment  vehicles  and  may  indeed  have  done  so. 

A  second  unusual  feature  of  the  context  was  that  respondents  were 
informed  that,  "If  cleanup  efforts  cost  less  than  people  are  willing  to  pay, 
the  fees  would  be  lowered  so  that  everyone  would  pay  only  a  share  of  what  the 
cleanup  actually  costs."   This  feature  of  the  context  may  also  have  served  to 
reduce  protest  bids.   Respondents  sometimes  express  skepticism  about  whether 
money  committed  to  environmental  remediation  will  actually  be  spent  on  it. 
For  example,  during  the  verbal  protocols  in  the  Clark  Fork  study,  one  subject 
said,  "Well,  I  think  if  they  would  assure  me  as  a  Montana  resident  that  my 
money  would  only  go  to  the  cleanup  of  the  Clark  Fork  Site  and  that  somehow  we 
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could  manage  to  keep  the  bureaucrats  away  from  the  money,  so  that  they  don't 
hire  their  own  assistants  for  their  assistants,  I  would  say  that  I  wouldn't 
mind  an  increase  in  my  state  or  property  or  gasoline  tax..." 

Respondents  may  also  sometimes  try  to  calculate  the  amount  of  money  that 
would  be  raised  if  a  specified  amount  was  collected  from  all  citizens.   A  hint 
of  this  perspective  was  also  revealed  in  the  verbal  protocols  when  one 
respondent  said,  "If  the  method  was  more  on  fishing  and  hunting  licenses,  I 
think  $3  would  add  up  to  be  a  lot." 

Potential  biases  associated  with  either  of  these  points  of  view  may  have 
been  reduced  by  specifying  that  fees  would  be  lowered  if  the  cost  of  the 
cleanup  proved  to  be  less  than  the  money  that  was  collected.   Thus,  treatments 
of  the  payment  vehicle  and  surplus  funds  enhanced  the  validity  of  the  results. 

On  the  other  hand,  some  reservations  about  the  context  also  come  to 
mind.   The  timing  of  the  environmental  improvement  was  not  explicitly  stated 
in  the  contingent  valuation  scenario.   At  least  for  potential  users,  this 
could  have  been  an  important  ambiguity.   Also,  the  expressions  of  willingness 
to  pay  were  collected  using  a  payment  card  instead  of  the  referendum  format. 
The  referendum  format  is  favored  by  the  NOAA  Panel  and  many  current 
practitioners  because  of  the  simplicity,  plausibility,  and  incentive 
compatibility  of  referenda  in  many  situations.   Claims  for  incentive 
compatibility  would  be  difficult  to  prove  for  the  format  chosen  in  the  Clark 
Fork  study.   While  the  investigators  claim  that  any  incentives  for 
misstatement  of  values  would  result  in  a  downward  bias,  they  do  not  explore 
this  issue  in  much  depth. 

Weighing  the  rather  strong  features  against  the  potential  faults  just 
noted,  a  rating  of  3  out  of  a  possible  5  points  were  assigned  for  the  context. 

8.   Does  the  CV  question  elicit  willingness  to  pay? 

Clearly  the  survey  elicited  willingness  to  pay.   Thus,  the  full  B  points 
for  this  item  were  assigned. 
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2^ Did  survey  respondents  accept  the  scenario? 

This  is  a  difficult  issue  to  assess.   The  investigators  clearly  made  an 
effort  to  design  a  scenario  that  was  plausible  to  the  respondents.   Allowing 
the  respondent  to  choose  an  acceptable  payment  vehicle  should  also  have 
increased  the  acceptance  of  the  scenario.   One  bit  of  qualitative  evidence  of 
scenario  acceptance  is  the  result  that  only  46  out  of  the  total  of  841 
responses  (5.5  percent)  were  clearly  identified  as  protest  zeroes  and  only  21 
(2.5  percent)  were  identified  as  outliers.   This  is  evidence  that  a  large 
percentage  of  the  respondents  accepted  the  scenario. 

On  the  other  hand,  the  survey  instrument  contained  a  question  asking  if 
the  respondent  felt  they  considered  themselves  personally  responsible  for 
paying  part  of  the  cost  of  cleanup.   More  than  half  of  the  respondents  in  the 
cleaned  data  set  (Table  5-22)  felt  little  or  no  responsibility  for  paying  for 
cleanup  (i.e.,  they  chose  1  or  2  on  a  seven-point  scale,  where  1  represented 
"not  at  all  responsible").  This  is  symptomatic  of  scenario  rejection  at  some 
level.   However,  the  authors  controlled  statistically  for  the  resulting  bias 
in  arriving  at  their  final  value  estimates. 

Weighing  all  these  considerations,  I  assigned  this  study  4  points  out  of 
5  for  scenario  acceptance. 

10.   Did  survey  respondents  believe  the  scenario? 

Here,  I  turn  to  whether  the  respondents  believed  that  their  responses 
would  actually  affect  both  the  flows  of  environmental  services  from  the 
resources  at  issue  at  the  Clark  Fork  sites  and  what  they  would  actually  pay. 
At  least  half  of  the  respondents  claimed  to  be  familiar  with  one  or  more  of 
the  Clark  Fork  sites.   The  investigators  built  on  this  awareness  by  explaining 
in  the  survey  that  "...  to  make  decisions  about  cleanup  programs  that  could 
cost  you  money,  it  is  important  to  know  how  much  it  is  worth  to  you  to  clean 
up  the  Clark  Fork  River  basin."   The  cover  letter  stated  that  study  results 
would  be  considered  when  future  decisions  about  potential  cleanups  were  made. 
These  steps  helped  to  make  the  scenario  believable. 

On  the  other  hand,  the  survey  instrument  did  not  make  clear  how  the 
survey  responses  would  be  translated  into  decisions  about  Clark  Fork 
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resources.   Furthermore,  the  survey  did  not  contain  any  questions  asking  if 
the  respondents  felt  that  actual  decisions  regarding  cleanups  would  be  based 
on  the  results  of  the  survey.   The  flexibility  of  the  payment  vehicle,  which 
had  the  advantages  already  noted,  had  the  disadvantage  of  making  the  whole 
exercise  seem  more  hypothetical .   It  is  difficult  to  determine  whether  survey 
respondents  really  believed  that  responses  to  the  survey  would  affect 
decisions  about  the  level  of  cleanup  and  their  own  expenditure  patterns. 

I  assigned  3  out  of  a  possible  5  points  for  the  believability  of  the 
scenario . 

11.  How  adequate  and  complete  were  survey  questions  other  than  those  designed 
to  elicit  values? 

I  found  the  non-valuation  questions  to  be  generally  adequate.  Many 
questions  were  asked  that  could  be  used  in  assessing  the  construct  validity  of 
the  study.  These  included  questions  related  to  potential  uses  of  the  affected 
resources  (Questions  23,  24  and  25),  importance  of  cleaning  up  NPL  sites  in 
Montana  (Questions  15  through  22),  and  the  demographic  questions  (Questions  35 
through  44) .  Respondents  were  provided  with  an  opportunity  to  provide  written 
comments  about  their  reactions  to  the  valuation  question. 

A  total  of  4  out  of  5  points  were  assigned  here.   The  full  5  points  were 
not  assigned  because  other  questions  might  have  been  asked.   For  example, 
while  the  study  did  ask  about  environmentally  related  behavior  (recycling, 
organization  membership,  etc.)  I  would  have  been  tempted  to  add  some  standard 
questions  focusing  on  environmental  attitudes. 

12 .  Was  the  survey  mode  appropriate? 

Contrary  to  the  conclusions  of  the  NOAA  Panel,  1  believe  that  mail 
surveys  can  be  constructed  to  adequately  carry  out  a  contingent  valuation 
study  for  natural  resource  damage  assessment.   Nevertheless,  in  this 
particular  case,  personal  interviews  might  have  been  a  better  choice.   A  large 
amount  of  information  was  provided  to  respondents,  perhaps  approaching  or  even 
exceeding  the  limits  of  what  can  be  reasonably  accomplished  using  a  mail 
survey.   This  could  have  been  a  partial  contributor  to  the  apparent  failure  of 
a  full  between-sample  scope  test,  for  example.   (I  will  return  to  the  scope 
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issue  in  addressing  the  construct  validity  later  on.)   Personal  interviews 
might  have  resulted  in  a  higher  degree  of  confidence  that  information  in  the 
survey  instrument  was  successfully  communicated  to  survey  respondents.   I 
should  note  in  passing  that,  although  personal  interviews  might  have  been 
preferable  from  a  scientific  standpoint  in  this  case,  adequate  personal 
interviews  would  have  cost  hundreds  of  dollars  more  per  survey.   It  is  not 
surprising  that  I  and  others  do  mostly  mail  surveys  in  contingent  valuation 
studies.   Whether  the  potentially  improved  communication  would  have  been  worth 
the  extra  cost  is  an  open  question. 

Using  an  incentive  as  high  as  $20  has  not,  to  my  knowledge,  been  done 
previously  in  a  contingent  valuation  survey.   In  addition  to  its  likely 
contribution  toward  a  higher  response  rate,  it  is  plausible  that  the  $20 
encouraged  many  respondents  to  read  the  material  and  respond  carefully, 
counterbalancing  some  of  the  potential  adverse  affects  of  using  a  mail  survey. 
What  other  effects  the  incentive  might  have  had  are  not  known.   From  a 
theoretical  perspective,  $20  is  small  relative  to  income  levels  of  respondents 
and  the  amount  paid  to  respondents  was  independent  of  the  values  they 
expressed  in  the  contingent  valuation  exercise.   Consequently,  one  would 
hypothesize  a  negligible  effect  of  the  incentive  on  results,  but  to  my 
knowledge  this  hypothesis  has  not  been  tested.   Given  the  high  costs  of 
personal  interviews,  large  incentives  for  mail  responses  may  be  a  cost- 
effective  substitute  and  should  be  researched. 

Concerns  about  potential  problems  exacerbated  by  the  mail  survey  are 
sufficiently  strong  to  assign  only  6  points  out  of  10  for  survey  mode. 

13.   Were  qualitative  research  procedures,  pretests,  and  pilots  sufficient  to 
find  and  remedy  identifiable  flaws  in  the  instrument  and  associated  materials? 

The  investigators  carried  out  three  separate  rounds  of  qualitative 
research,  which  included  one  set  of  verbal  protocols  and  two  pretests. 
Participants  in  the  pretests  were  asked  to  fill  out  early  drafts  of  survey 
instruments.   After  filling  the  survey  out,  they  were  engaged  in  a  de-briefing 
process . 

Determining  whether  this  was  a  sufficient  amount  of  preliminary  research 
is  difficult.   At  the  conclusion  of  the  second  pretest,  the  preliminary 
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results  appeared  very  favorable  in  terms  of  relative  values  of  mean 
willingness  to  pay.   In  particular,  the  prospects  for  a  cross-sample  scope 
test  appeared  favorable.   Furthermore,  estimates  of  mean  willingness  to  pay 
were  nearly  identical  regardless  of  whether  the  complete  cleanup  was  valued 
first  or  second.   It  might  have  been  prudent  to  perform  a  more  extensive  pilot 
test  that  would  have  involved  larger  sample  sizes  to  determine  whether  the 
results  obtained  in  pretests,  which  had  relatively  small  sample  sizes  per 
survey  version,  would  be  replicated  in  a  true  mail  survey  format. 
Furthermore,  a  more  extensive  pilot  test  could  have  revealed  whether  responses 
from  the  pretests  were  an  artifact  of  the  locations  (Helena  and  Missoula)  in 
which  they  were  conducted.   However,  it  is  only  fair  to  add  that,  while  some 
of  my  concern  has  focused  on  the  apparent  failure  of  this  study  in  a  cross- 
sample  scope  test,  the  idea  of  a  scope  test  had  not  been  formally  defined  at 
the  time  this  research  was  designed. 

I  assigned  this  study  4  points  out  of  a  possible  5  for  this  item. 

14.   Given  study  objectives,  how  adequate  were  procedures  employed  to  choose 

study  subjects,  assign  them  to  treatments  (if  applicable). and  encourage  high 

response  rates? 

The  sample  used  for  this  study  was  obtained  from  a  highly  regarded 
commercial  vendor.  Survey  Sampling  Inc.   The  sample  frame  was  Montana 
telephone  listings.   Samples  of  this  type  are  always  subject  to  problems  of 
noncoverage  because  some  households  have  no  telephone  and  others  have  unlisted 
numbers.   In  the  case  of  the  Clark  Fork  study,  noncoverage  is  not  a  major 
problem  for  two  reasons.   First,  the  study  reported  that  approximately  83 
percent  of  Montana  households  have  listed  telephones.   There  are  certainly 
many  precedents  in  the  contingent  valuation  literature  for  sampling  coverage 
at  this  level.   I  would  judge  the  level  of  coverage  to  be  acceptable  given 
that  coverage  at  higher  levels  of  accuracy  would  have  been  difficult  and 
expensive  to  obtain.   Second,  econometric  procedures  were  used  to  adjust  mean 
value  estimates  to  account  for  both  noncoverage  and  nonresponse . 

The  response  rate  in  this  study  was  relatively  high,  particularly  in 
light  of  the  large  amount  of  information  respondents  were  expected  to  read  and 
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consider  when  answering  the  questions.   Nonresponse  bias  does  not  appear  to  be 
an  issue,  particularly  after  applying  the  procedure  just  mentioned. 

Based  on  the  conclusion  that  the  study  is  basically  sound  with  regard  to 
sampling  and  survey  procedures  combined  with  minor  concerns  about  sampling 
noncoverage  I  assigned  4  points  out  of  a  possible  5  points. 

15.  Was  the  econometric  analysis  adequate? 

The  econometric  procedures  applied  were  sound.   The  researchers  were 
able  to  use  standard  regression  techniques  to  define  statistically  significant 
valuation  equations.   The  valuation  equations  included  several  variables  that 
supported  the  construct  validity  of  the  study,  as  will  be  discussed  shortly. 
Furthermore,  the  econometric  analysis  was  used  to  carry  out  adjustments  to 
willingness  to  pay  to  account  for  potential  biases  associated  with 
noncoverage,  nonresponse,  and  feelings  of  responsibility.   The  procedures  used 
to  identify  protest  zeroes  and  outliers  made  sense.   While  the  procedures  that 
were  used  for  sorting  zeroes  and  outliers  would,  all  else  equal,  introduce  a 
downward  bias  into  the  values  for  complete  and  partial  cleanup,  the  impact  on 
residual  values  was  small. 

I  assigned  this  study  4  points  out  of  a  possible  5  for  econometric 
procedures,  reflecting  that  more  analysis  might  have  been  done. 

16.  How  adequate  are  the  written  materials  from  the  study? 

Implementation  procedures,  response  rates,  analytical  procedures,  and 
results  were  all  covered  fairly  well  in  the  written  report.   I  would  have 
liked  more  details  on  some  issues  such  as  design  procedures.   I  assigned  the 
study  4  points  out  of  5  for  reporting. 

17.  Are  there  other  concerns  relating  to  the  design  and  execution  of  the 
study  that  have  not  already  been  addressed? 

I  have  only  one  other  concern  to  raise.   I  would  not  have  followed  the 

procedure  used  in  this  study  to  address  the  embedding  issue.   I  question 

whether  embedding  is  a  real  problem  in  studies  like  this  one.   The  most  widely 

cited  paper  on  embedding  is  the  one  by  Kahneman  and  Knetsch,  but  as  McCollum 

and  I  explain  in  the  paper  in  Appendix  B,  we  believe  that  that  study  is 
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fatally  flawed  in  terms  of  the  procedures  followed.   I  tend  to  think  that 
embedding  is  not  a  very  powerful  influence  in  studies  that  are  as  specific  in 
their  scenarios  as  the  Clark  Fork  study. 

If  embedding  is  not  really  a  problem,  then  Survey  Question  30,  asking 
respondents  to  "disembed"  their  previously  stated  values,  is  unnecessary. 
Using  the  responses  to  Survey  Question  30  to  reduce  the  estimated  values  of 
both  cleanup  options,  as  is  done  in  the  Clark  Fork  study,  introduces  a 
downward  bias  into  the  resulting  value  estimates.   Because  of  this  concern,  I 
have  deducted  2  points  from  the  possible  10  under  Question  17  of  the  rating 
form.   The  relatively  light  deduction  is  intended  to  express  my  desire  not  to 
be  overly  doctrinaire  on  this  point.   The  presence  or  absence  of  embedding  in 
responses  to  contingent  valuation  questions  is  an  issue  that  is  still  being 
debated  by  researchers. 

Summary 

Results  of  my  content  validity  assessment  are  summarized  in  Table  1. 
The  Clark  Fork  study  did  quite  well,  earning  80  percent  or  more  of  total 
points  on  most  items.   More  serious  concerns  arose  regarding  the  adequacy  of 
attention  to  theory,  the  choice  of  a  mail  survey  mode,  the  adequacy  of  the 
procedures  to  identify  respondent-relevant  attributes,  to  communicate  the 
effects  of  the  intervention,  and  to  specify  the  context  for  valuation.  Even 
where  such  concerns  existed,  however,  scores  equal  to  at  least  60  percent  of 
total  points  were  earned.   A  score  of  3  out  of  a  possible  5  points  was 
assigned  to  the  believability  of  the  scenario,  but  more  believable  scenarios 
are  not  all  that  common.   The  total  score  of  73  out  of  a  possible  100  points 
signifies  a  relatively  strong  study.   After  considering  construct  validity,  I 
will  further  examine  this  conclusion  by  rating  two  other  studies  and  comparing 
scores. 
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Table  1:   Content  Validity  Scores  for  the  Clark  Fork  Study 


Questions  (See  Appendix  B  for  further  explanation; 
questions  taken  from  Table  1  of  Appendix  B) 

Score/ 
Possible  Points 

1 .   Do  the  study  procedures  contain  flaws  so  serious 
that  they  would  rule  out  the  use  of  the  results  to 
achieve  study  goals? 

No. 

2 .   Was  the  true  value  defined? 

3/5 

3 .   Were  the  environmental  attributes  relevant  to 
potential  subjects  fully  identified? 

7/10 

4.   Were  the  potential  effects  of  intervention  on 
environmental  attributes  and  other  economic  parameters 
adequately  documented  and  communicated? 

6/10 

5.   Were  respondents  aware  of  the  existence  and  status 
of  environmental  substitutes? 

4/5 

6 .   Was  the  budget  constraint  adequately  stressed? 

4/5 

7.   Was  the  context  for  valuation  fully  specified  and 
incentive -compatible? 

3/5 

8.   Does  the  CV  question  elicit  willingness  to  pay? 

5/5 

9.   Did  survey  respondents  accept  the  scenario? 

4/5 

10.  Did  the  survey  respondents  believe  the  scenario? 

3/5 

11.  How  adequate  and  complete  were  survey  questions 
other  than  those  designed  to  elicit  values? 

4/5 

12 .  Was  the  survey  mode  appropriate? 

6/10 

13.  Were  qualitative  research  procedures,  pretests,  and 
pilots  sufficient  to  find  and  remedy  identifiable  flaws 
in  the  instrument  and  associated  materials? 

4/5 

14.  Given  study  objectives,  how  adequate  were 
procedures  employed  to  choose  study  subjects,  assign 
them  to  treatments  (if  applicable) ,  and  encourage  high 
response  rates? 

4/5 

15.  Was  the  econometric  analysis  adequate? 

4/5 

16 .  How  adequate  are  the  written  materials  from  the 

study? 

4/5 

17.  Are  there  other  concerns  relating  to  the  design  and 
execution  of  the  study   that  have  not  already  been 

addressed? 

8/10 

Total  Score 

73/100 
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CONSTRUCT  VALIDITY  ASSESSMENT 

As  explained  in  Appendix  A,  construct  validity  testing  of  contingent 
non-use  value  studies  depends  primarily  on  the  statistical  testing  of 
hypotheses  about  relationships  between  answers  to  the  valuation  question  and 
other  variables  as  predicted  from  theory.   For  example,  it  is  very  common  to 
test  the  relationship  between  value  expressions  and  income.   Passing  construct 
validity  tests  is  evidence  that  the  processes  depicted  by  economic  theory  are 
at  work  in  the  minds  of  respondents  as  they  answer  contingent  valuation 
questions.   This  in  turn  supports  interpreting  responses  as  estimates  of  true 
values,  as  defined  in  theory. 

The  paper  in  Appendix  A  stresses  that  failure  to  pass  any  construct 
validity  test  that  a  researcher  or  reviewer  might  invent  should  not  be 
considered  fatal.   In  fact,  econometric  studies  utilizing  market  data  often 
fail  theoretically  motivated  tests,  yet  are  considered  useful.   Paris  et  al . 
(1993)  recently  pointed  out  that  such  failures  are  common  in  research  on 
consumption  and  production  of  agricultural  commodities,  one  of  the  most 
intensively  studied  sectors  of  the  U.S.  economy.  They  go  on  to  point  out  (p. 
37) ,  "In  spite  of  such  unfavorable  results,  theorists  and  practitioners 
continue  to  derive  policy  conclusions  from  a  theory  that  appears  to  be  largely 
refuted."   How  can  this  be  so?   They  go  on  to  point  out  (p.  38),  "Frequent 
empirical  refutation  of  the  logical  foundations  of  traditional  consumer  and 
producer  theories  reveals  (at  least  in  part)  that  these  theories  may  not  be  as 
general  and  flexible  as  they  could  be."   In  other  words,  economic  theory  has 
substantial  limitations  in  predicting  and  explaining  behavior.   They  might 
also  have  mentioned  that  econometric  methods  have  their  limitations  as  well. 
Relationships  predicted  by  theory  may  be  present  in  the  data,  but  the 
econometric  techniques  to  identify  them  may  not  yet  be  available.   Progress 
comes  as  these  theoretical  and  empirical  limitations  are  overcome.   In  the 
meantime,  and  despite  imperfections,  useful  results  are  often  obtained. 

Inconsistencies  with  economic  theory  occur  despite  the  fact  that  market 
data  have  high  credibility  among  economists  as  indicators  of  true  values. 
Surely,  alternative,  higher  standards  should  not  be  applied  to  contingent 
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valuation.   Contingent  valuation  data,  like  market  data,  could  be  useful  for 
policy  analysis  and  damage  assessment  despite  some  failures  to  pass  construct 
validity  tests.   Still,  the  more  often  construct  validity  tests  are  passed, 
the  more  confident  are  researchers  and  reviewers  that  economic  studies  are 
measuring  what  they  were  designed  to  measure. 

Economic  theory  presents  a  great  many  possible  hypotheses  that  could  be 
tested.   These  range  from  the  general  observation  that  people  with  strong 
preferences  for  the  outcome  of  an  environmental  intervention  should  express 
higher  willingness-to-pay  values  to  the  notion  that  higher  incomes,  all  else 
equal,  might  be  expected  to  result  in  higher  willingness  to  pay.   Such 
hypotheses  can  be  tested  using  regression  techniques  in  which  willingness  to 
pay  is  predicted  by  variables  measuring  reported  behaviors,  attitudes,  income, 
and  various  socioeconomic  characteristics.   In  the  paper  in  Appendix  A,  we 
refer  to  such  tests  as  "rudimentary  tests."   We  also  consider  more  "advanced 
tests,"  tests  that  involve  comparisons  of  two  or  more  contingent  values. 
Scope  tests  are  examples  of  advanced  tests.   Scope  tests  investigate  the 
degree  to  which  willingness  to  pay  is  systematically  related  to  the  dimensions 
of  the  environmental  intervention.   One  would  expect,  all  else  equal,  that 
willingness  to  pay  would  be  larger  for  an  intervention  that  results  in  either 
higher  quality  or  quantity  of  environmental  attributes  than  for  an 
intervention  resulting  in  lower  quality  or  quantity  of  environmental 
attributes . 

Advanced  tests  are  not  limited  to  scope  tests.   For  example,  one  could 
consider  an  environmental  intervention  that  results  in  a  change  in  the  quality 
of  two  environmental  attributes,  A  and  B.   Further  suppose  that  attributes  A 
and  B  are  not  joint  in  production  and  exhibit  substitutability  in  the 
preferences  of  consumers.   If  the  theory  of  utility  maximization  holds,  one 
might  expect  that  the  sum  of  willingness  to  pay  for  an  improvement  in  A  and  an 
improvement  in  B  would  exceed  the  willingness  to  pay  for  a  simultaneous 
improvement  in  both  A  and  B.   Such  a  test  could  be  carried  out  using  three 
independent  samples.   The  first  could  value  an  improvement  in  A;  the  second, 
an  improvement  in  B;  and  the  third,  a  simultaneous  improvement  in  A  and  B. 
Another  type  of  construct  validity  test  could  be  based  on  the  concept  of 
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transitivity.   If  intervention  X  is  valued  more  highly  than  intervention  Y  by 

one  sample,  and  intervention  Y  is  more  highly  valued  than  intervention  Z  by 

another,  it  might  be  hypothesized  that  a  third  sample  would  place  a  higher 

value  on  X  than  on  Z.   Such  advanced  tests  of  construct  validity  are  becoming 

increasingly  important  to  the  evaluation  of  contingent  valuation  studies. 

In  the  paper  included  as  Appendix  A  of  this  report,  we  suggest  that 

studies  be  classified  into  a  three- level  hierarchy  expressing  increasing 

levels  of  construct  validity.   Quoting  from  that  paper: 

At  the  lowest  level  would  be  studies  that  either  have  not  included  any 
construct  validity  tests  or  have  failed  to  pass  rudimentary  tests.   Such 
studies  might  typically  have  had  low  budgets  and/or  severe  time 
constraints  and  this  may  have  limited  the  amount  of  qualitative  research 
that  could  be  conducted,  thus  limiting  the  content  validity  of  the  study 
as  well.   Such  studies  may  still  be  useful  for  scientific  purposes  or  as 
exercises  involving  training  of  students,  but  should  be  used  in  policy 
analysis  and  litigation  only  with  the  heaviest  caveats.   The  second 
level  of  the  hierarchy  would  involve  studies  that  have  achieved  a  fair 
amount  of  success  in  the  rudimentary  tests,  but  that  either  do  not  have 
the  budget  to  support  advanced  testing  or  have  not  succeeded  in  passing 
advanced  tests.   Second-level  studies  may  be  usable  in  cost-benefit 
analyses,  since  normally  such  analyses  are  simply  interested  in 
determining  whether  the  benefits  of  an  intervention  exceed  the  costs. 
Of  course,  suitable  caveats  would  need  to  be  introduced  into  such 
studies.   Unless  benefits  exceed  costs  by  a  fairly  wide  margin  or  vice 
versa,  potential  biases  in  second  level  studies  may  mean  that  the  issue 
of  whether  benefits  exceed  costs  remains  open.   Second  level  studies  may 
be  less  useful  for  litigation,  where  relatively  precise  estimates  of 
value  are  needed  to  assess  damages,  but  they  may  still  be  useful  in 
preliminary  damage  assessments  and  for  such  purposes  as  "grossly 
disproportionate  tests."  Third  level  studies  are  studies  that  have 
conducted  and  achieved  substantial  success  in  sophisticated  rudimentary 
tests  and/or  have  conducted  and  passed  advanced  tests.   Provided  that 
such  studies  are  judged  to  have  a  high  degree  of  content  validity  as 
well,  they  would  have  the  highest  level  of  credibility  for  benefit-cost 
analysis  and  litigation. 

The  Clark  Fork  study  report  presents  valuation  equations  estimated  using 

regression.   Independent  variables  included  a  scale  item  on  the  respondents' 

self -reported  importance  of  cleaning  up  the  Clark  Fork  NPL  sites;  the  sum  of 

scale  items  on  the  importance  of  issues  associated  with  groundwater,  surface 

water,  and  terrestrial  contamination  in  Montana;  a  measure  of  respondents' 

views  about  the  likelihood  of  using  the  resources;   a  measure  of  respondents' 

rankings  in  terms  of  the  importance  of  various  reasons  for  cleanup;  and  the 

self -reported  degree  to  which  respondents  felt  they  should  be  responsible  for 

paying  for  a  cleanup.   The  regressions  also  included  a  series  of  variables 

reflecting  respondent  characteristics.   These  variables  included  income,  age, 

gender,  participation  in  recycling  activities,  proximity  to  the  affected 
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sites,  and  whether  respondents  had  been  members  of  or  had  contributed  to  an 
environmental  organization  in  the  year  prior  to  receiving  the  survey. 

The  results  of  these  regressions  demonstrated  that  willingness  to  pay 
for  cleanup  was  frequently  related  to  these  variables  in  ways  that  are 
consistent  with  economic  theory.   For  example,  the  regressions  consistently 
identified  income  as  a  significant  positive  predictor  of  willingness  to  pay. 
In  addition,  respondents  living  closer  to  the  site  expressed  significantly 
higher  values.   Finally,  the  variables  reflecting  the  importance  of  various 
aspects  of  the  intervention  were  often  significant.   Individuals  expressing 
higher  levels  of  concern  for  cleaning  up  hazardous  waste  sites  and/or  higher 
levels  of  concern  for  cleaning  up  the  Clark  Fork  NPL  sites  tended  to  express 
high  willingness-to-pay  values.   These  rudimentary  tests  provide  solid  support 
for  the  construct  validity  of  the  study. 

Moving  to  advanced  tests,  one  can  see  from  the  survey  results  that  many 
respondents  prefer  complete  cleanup  to  partial  cleanup  of  the  Clark  Fork 
sites.   Thus,  a  scope  test  here  would  involve  testing  the  hypothesis  that 
complete  cleanup  has  a  higher  reported  value.   The  most  rigorous  form  of  such 
a  scope  test  would  involve  a  comparison  of  mean  stated  willingness  to  pay  from 
two  independent  samples.   In  the  Clark  Fork  study,  the  most  rigorous  scope 
test  would  involve  a  comparison  between  mean  willingness  to  pay  for  complete 
cleanup  from  Version  1  of  the  survey  instrument  and  mean  willingness  to  pay 
for  partial  cleanup  from  Version  2  (as  reported  in  Table  5-4) .   The  Clark  Fork 
study  clearly  does  not  pass  this  most  rigorous  scope  test,  obtaining  nearly 
identical  mean  values  for  complete  and  partial  cleanup  before  and  after 
adjustments  for  protest  zeroes,  outliers,  embedding,  sampling  issues,  and 
feelings  of  responsibility. 

This  failure  to  pass  the  most  rigorous  scope  test  does  count  against  the 
construct  validity  of  the  study,  but  should  not  be  taken  as  damning  evidence. 
The  between-sample  scope  result  could  be  a  consequence  of  several  factors. 
For  instance,  the  test  is  predicated  on  the  idea  that  there  is,  in  reality,  a 
difference  in  willingness  to  pay  for  the  two  levels  of  cleanup.   Regardless  of 
the  researchers'  perception  of  differences  between  the  two  levels  of 
environmental  intervention,  if  nearly  all  respondents  do  not  attach  different 
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values  across  the  two  levels  of  cleanup,  a  scope  test  could  not  be  passed.   In 
the  same  vein,  if  the  difference  between  the  willingness  to  pay  for  the  two 
interventions  is  small,  then  detecting  this  difference  might  require  very 
large  samples,  a  potential  problem  that  would  be  exacerbated  by  the  rather 
large  gaps  between  values  in  the  payment  card. 

The  researchers  do  further  analysis  on  the  between- sample  scope  test 
using  subsets  of  their  samples.   The  embedding  question  was  stated  as  follows: 


Q30    Some  people  tell  us  it  is  difficult  to  think  about  paying  to  clean 
up  just  one  site  or  even  just  one  environmental  problem.   Would 
you  say  the  dollar  amount  in  Q2  8  you  stated  your  household  would 
be  willing  to  pay  is:   (Circle  number  of  best  answer) 

1  JUST  FOR  CLEANUP  AT  THE  CLARK  FORK  RIVER  BASIN.   GO  TO 
Q32. 

2  PARTLY  FOR  CLEANUP  AT  THE  CLARK  FORK  RIVER  BASIN  AND 
PARTLY  TO  CLEAN  UP  OTHER  HAZARDOUS  WASTE  SITES. 

3  BASICALLY  A  CONTRIBUTION  FOR  ALL  ENVIRONMENTAL  OR 
OTHER  CAUSES. 


OTHER  (PLEASE  SPECIFY) 


Those  who  answered  this  question  with  "Other"  were  excluded  from  the 
data  set  and  the  between-sample  scope  hypothesis  was  again  tested.   This 
procedure  is  appealing  because  those  who  answered  "Other"  may  be  those  who  had 
the  most  difficulty  with  the  contingent  valuation  questions  to  begin  with. 
The  subsample  of  remaining  respondents  shows  a  difference  in  values  for 
complete  and  partial  cleanup.   The  difference  is  not  very  large,  but  is 
statistically  significant  at  the  11  percent  level  in  a  one-tailed  test 
(p.  5-41)  .   Stated  differently,  the  researchers  can  be  89  percent  confident 
that  the  scope  test  is  passed  for  this  subsample  of  respondents.   This 
supports  the  construct  validity  of  the  study,  although  not  as  much  as  passing 
the  scope  test  with  the  full  set  of  observations. 

Another,  albeit  weaker,  scope  test  involves  a  comparison  of  average 
values  for  complete  cleanup  to  average  values  for  partial  cleanup  for  Versions 
1  and  2  separately.   This  "within-sample"  scope  test  is  not  as  compelling  as 
the  between- sample  test  because,  having  read  about  both  levels  of  cleanup, 
respondents  may  anticipate  that  the  researchers  are  expecting  different  values 
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commensurate  with  the  different  levels  of  cleanup.   However,  within-sample 
scope  tests  do  provide  information  on  construct  validity  despite  this  possible 
shortcoming.   The  Clark  Fork  study  would  pass  this  test. 

Given  the  study's  considerable  success  in  passing  rudimentary  construct 
validity  tests  and  its  mixed,  but  positive,  performance  in  passing  scope 
tests,  I  would  designate  it  a  Level  3  study,  but  at  the  low  end  of  Level  3. 
Clearly,  a  study  that  passes  a  cross-sample  scope  test  with  flying  colors 
would  have  to  be  considered  stronger  in  construct  validity.   However,  to 
ignore  the  within-sample  scope  test  results  would  not  be  justified  either, 
since  they  show  that,  given  the  two  levels  of  cleanup  in  juxtaposition,  many 
respondents  recognized  the  differences  and  translated  the  differences  into  the 
values  they  expressed.   That  they  were  influenced  by  their  perceptions  of 
researchers  expectations  is  possible,  but  there  is  no  evidence  on  this  topic 
one  way  or  the  other.   It  is  somewhat  encouraging  that  the  residual  values  are 
so  close  between  Versions  1  and  2.   True,  this  could  be  coincidence.   It  could 
also  be  true  that,  as  the  report  suggested  (p.  5-11),  "This  [i.e.,  the 
similarity  in  residual  willingness  to  pay  across  the  two  versions]  may  reflect 
that  respondents  have  somewhat  more  measurement  error  in  selecting  a  WTP 
amount  for  complete  or  partial  cleanup  than  they  have  in  determining  the 
difference  in  value  they  assign  when  comparing  two  scenarios."   Though  weaker 
as  evidence  of  construct  validity  than  cross-sample  scope  tests,  within-sample 
scope  tests  deserve  to  be  treated  as  advanced  tests  in  deciding  whether  to 
assign  a  study  to  Level  2  or  Level  3 . 


COMPARISON  OF  CLARK  FORK  STUDY  WITH  TWO  OTHER  STUDIES 

The  content  and  construct  validity  assessment  of  the  Clark  Fork  study 
presented  in  the  previous  two  sections  is  intended  to  stand  on  its  own  up  to  a 
point.   However,  further  understanding  of  my  evaluation  of  the  Clark  Fork 
study  can  be  added  by  comparing  its  evaluation  to  my  evaluations  of  two  other 
studies.   For  these  comparisons,  I  chose  studies  that  were  related  directly  or 
indirectly  to  the  damage  assessment  of  the  Exxon  Valdez  oil  spill  and  hence 
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are  potentially  relevant  in  the  context  of  the  natural  resource  damage 
assessment.   One  is  the  passive,  or  non-use  value  study  done  for  the  state  of 
Alaska  and  reported  in  Carson  et  al .  (1992)  .   Many,  including  the  NOAA  Panel, 
view  this  study  to  be  of  high  quality. 

Second,  I  will  examine  a  contingent  valuation  study  performed  for  Exxon. 
This  study  resulted  in  a  data  set  that  formed  the  basis  for  a  series  of  papers 
presented  at  a  conference  in  1992  (Cambridge  Economics  1992) ,  in  a  subsequent 
proceedings  volume  from  that  conference  (Hausman  1993),  and  most  recently  in 
an  article  published  in  the  American  Journal  of  Agricultural  Economics 
(McFadden  1994) .   This  study  did  not  address  the  damages  from  the  Valdez  spill 
directly,  but  rather  estimated  the  values  associated  with  avoiding  logging  in 
some  wilderness  areas  in  the  West.   The  purpose  of  this  study  was  to  collect 
data  that  could  be  used  to  assess  the  overall  validity  of  the  contingent 
valuation  method.   The  survey  involved  a  number  of  treatments  that  differed  in 
terms  of  the  format  of  the  contingent  valuation  question  and  the  number  of 
wilderness  areas  to  be  logged.   Depending  on  the  treatment,  logging  would  have 
occurred  in  from  1  to  57  wilderness  areas  in  National  Forests  in  Colorado, 
Wyoming,  Montana,  and  Idaho. 

I  will  begin  by  comparing  content  validity  scores  for  the  three  studies, 
using  WA  to  symbolize  the  wilderness  area  study,  CF  to  symbolize  the  Clark 
Fork  study,  and  EV  to  symbolize  the  damage  assessment  for  the  Exxon  Valdez 
spill.   Each  of  the  questions  considered  for  CF  will  now  be  applied  to  WA  and 
EV.   Results  are  summarized  in  Table  2. 

In  my  judgement,  neither  WA  nor  EV  had  fatal  flaws.   Some  might  debate 
this  conclusion  with  regard  to  WA.   (See,  for  example,  Carson  and  Flores  1992) 
I  did  assign  WA  relatively  low  scores  on  items  3,4,  and  12.   Nevertheless,  I 
have  not  called  the  flaws  noted  there  fatal.   It  seems  to  me  that  the 
procedures  used  in  WA  are  at  least  as  good  as  many  studies  that  have  appeared 
in  the  published  literature.   The  data  were  considered  strong  enough  by 
several  prominent  economists  to  warrant  their  attention  for  analysis.   These 
data  also  were  considered  strong  enough  by  some  to  warrant  a  major  paper  in 
one  the  best  economics  journals  that  regularly  publishes  contingent  valuation 
research.   Given  the  WA's  goals,  which  were  purely  methodological  and  did  not 
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involve  damage  evaluation  or  policy  analysis,  and  given  the  very  interesting 
analysis  of  the  data  in  McFadden's  paper,  I  am  unwilling  to  throw  it  out. 

None  of  the  three  studies  suffered  from  a  great  deal  of  theoretical 
ambiguity  about  the  nature  of  the  value  (or  values)  they  were  trying  to 
measure.   For  purposes  of  a  rating  on  Question  2,  all  three  studies  gave 
considerable  explicit  attention  to  the  theoretical  definition  of  the  true 
value  they  were  seeking  to  estimate.   CF  was  rated  slightly  lower  than  EV  and 
WA  in  this  regard  because  of  my  theoretical  misgivings  about  its  partitioning 
of  resource  values  and  because  of  its  failure  to  consider  the  theoretical 
relationships  between  the  true  value  of  the  damages  and  the  concept  of 
residual  value  that  was  used  as  a  proxy,  as  discussed  above. 

More  serious  concerns  arise  about  WA  under  Questions  3  and  4 .   It  was 
apparently  assumed  that  the  only  attributes  relevant  to  potential  respondents 
were  the  size  of  the  areas  to  be  logged,  the  percent  of  area  logged  each  year, 
and  the  fact  that  logging  would  involve  road  building  and  use  of  heavy 
equipment.   (Other  wilderness  areas  to  be  logged  or  not  logged  are  treated  as 
substitutes  here,  not  attributes.)   In  fact,  logging  in  wilderness  areas  would 
affect  a  host  of  environmental  attributes  of  potential  relevance  to  study 
subjects.   For  example,  nothing  was  specified  about  logging  practices.   Clear 
cutting  is  considered  less  desirable  than  selective  cutting  by  some  members  of 
the  public,  yet  the  issue  was  apparently  not  raised.   Depending  on  logging 
practices,  water  quality  in  streams  and  downstream  reservoirs  would  also  be 
more  or  less  affected.   Changes  in  water  quality  would  in  turn  influence 
populations  of  fish  and  other  aquatic  organisms.   In  wilderness  areas  west  of 
the  Continental  Divide,  threatened  and  endangered  species  of  salmon  could  be 
affected.   Other  wildlife  populations  would  also  be  influenced  and  not  all  of 
the  effects  would  be  negative.   Populations  of  large  ungulates,  for  example, 
might  benefit  from  logging.   Potential  effects  of  the  logging  on  scenic  vistas 
were  not  explored.   Contingent  valuation  studies  conducted  in  the  context  of 
the  spotted  owl  controversy  showed  that  people  are  very  concerned  about  the 
fate  of  old  growth  forests,  yet  the  extent  to  which  the  proposed  logging  would 
involve  old  growth  was  not  mentioned.   There  appears  to  be  no  recognition  that 
opening  up  wilderness  areas  to  logging  might  be  viewed  as  a  positive  step  by 
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some  respondents  since  it  would  augment  employment  and  regional  incomes.   So 
far  as  I  can  tell,  the  research  involved  no  effort  to  document  what  these 
effects  would  be,  to  investigate  which  effects  potential  respondents  would 
find  relevant,  to  evaluate  the  current  state  of  knowledge  of  potential 
respondents  regarding  the  possible  effects  of  logging,  or  to  develop  effective 
ways  to  communicate  needed  information  to  them.   Furthermore,  the  pretests 
used  in  the  development  of  the  WA  instruments  are  not  as  well  suited  to 
identifying  important  attributes,  and  ways  to  communicate  changes  in  their 
status,  as  verbal  protocols  and/or  focus  groups.   As  nearly  as  I  have  been 
able  to  tell  based  on  Diamond  et  al .  (1992  p. 22),  the  pretest  consisted  of 
administering  by  phone  earlier  versions  of  a  mostly  developed  instrument  with 
follow-up  questions  designed  to  help  fine-tune  the  instrument.   Such  pretests 
are  somewhat  useful  in  improving  communication  but  provide  very  limited 
information  about  more  fundamental  aspects  of  the  scenario.   Hence,  WA  is 
rated  quite  low  under  Questions  3  and  4.   These  low  ratings  reflect  concerns 
that  respondents  were  not  well  informed  when  they  answered  the  contingent 
valuation  questions.   The  rating  of  the  WA  study  on  Questions  3  and  4  would  be 
improved  if  focus  groups  and/or  verbal  protocols  showed  that  the  effects  of 
logging  on  all  or  most  respondent -relevant  attributes  were  covered  in  the 
scenario.   I  am  not  aware  of  any  focus  groups  or  verbal  protocols  conducted 
for  WA. 

CF  and  EV  did  much  better  in  addressing  these  issues.   Both  were  linked 
to  large-scale  efforts  on  the  part  of  trustees  to  verify  injuries  to  resources 
from  the  releases  at  issue  and  drew  upon  those  efforts  for  information  about 
the  effects  of  the  interventions  being  evaluated.   Both  involved  extensive 
amounts  of  qualitative  research  to  identify  respondent -relevant  attributes  and 
learn  how  to  communicate  effects  well.   EV,  in  particular,  worked  intensively 
to  address  the  issues  under  Questions  3  and  4  and  received  full  points  on 
these  two  questions.   Its  efforts  in  qualitative  research  were  extensive.   It 
also  used  several  questions  in  various  versions  of  the  pretest  and  pilot 
survey  instruments  to  ensure  that  respondents  to  the  final  survey  were  well 
informed  when  they  came  to  the  valuation  question.   CF  received  fewer  points 
on  Question  4  than  EV  because  of  my  concerns  about  the  effectiveness  of 
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communications  in  that  study  and  the  lack  of  information  about  the  effects  of 
resource  injuries  as  discussed  above. 

WA  and  CF  are  roughly  similar  in  the  amount  of  information  they  present 
about  environmental  substitutes  (Question  5) .   Just  as  the  CF  instrument 
stressed  the  large  number  of  other  sites  in  Montana  where  toxic  contamination 
may  have  occurred,  so  the  WA  instrument  stressed  the  total  number  and  acreage 
of  wilderness  areas  in  the  four-state  area  and  the  number  to  be  logged.   I 
gave  CF  a  slightly  higher  rating  than  WA.   CF  provided  many  details  about 
substitute  NPL  sites.   In  comparison,  WA  was  very  skimpy  on  details  about  the 
other  wilderness  areas.   Except  for  saying  that  at  least  one  would  be  located 
in  the  respondent's  state,  neither  the  location  of  other  wilderness  areas  (7, 
8 ,  or  9  depending  on  the  treatment)  to  be  logged  nor  the  total  numbers  of 
acres  involved  was  mentioned.  (The  treatment  where  all  57  wilderness  areas 
were  to  be  logged  was  an  exception.)   EV  is  superior  to  both  CF  and  WA  in 
dealing  with  environmental  substitutes.   For  example,  it  presented  maps  that 
clearly  showed  that  only  a  small  percentage  of  the  Alaska  coastline  was 
affected  by  the  spill.   Controlling  air  pollution  and  protection  of  wilderness 
areas  were  explicitly  mentioned  in  early  parts  of  the  EV  survey  as  alternative 
uses  of  public  funds.   Just  before  the  contingent  valuation  question, 
respondents  were  told  that  some  respondents  voted  no  in  the  referendum  because 
they  felt  that  the  money  could  be  better  spent  elsewhere,  presumably  including 
other  environmentally  oriented  interventions. 

None  of  these  studies  explicitly  reminded  respondents  about  their  budget 
constraints.   However,  all  three  were  sufficiently  stark  in  their  emphasis  on 
commitments  of  money  to  be  satisfactory  in  this  regard  from  my  perspective. 
All  three  allowed  respondents  to  reconsider  their  responses  to  the  valuation 
question  (CF  through  its  embedding  question) ,  further  emphasizing  the  need  to 
consider  carefully  the  possible  payment  of  money  to  achieve  the  intervention. 
CF  asked  respondents  to  consider  alternative  ways  to  pay  for  the  intervention, 
which  also  served  to  emphasize  that  commitments  expressed  in  the  contingent 
valuation  question  would,  if  actually  paid,  deplete  the  respondent's  budget. 
EV  received  an  extra  point  in  my  ranking  under  Question  6  because,  just  prior 
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to  the  first  contingent  valuation  question,  it  noted  that  some  respondents 
would  vote  against  the  referendum  because  they  could  not  afford  the  cost. 

Turning  to  Question  7,  EV  had  a  carefully  designed  and  very  detailed 
context  for  valuation.   Furthermore,  the  referendum  format  used  by  EV  is 
widely  considered  to  have  theoretically  satisfactory  incentive  properties.   On 
the  other  hand,  both  EV  and  WA  used  a  tax  vehicle,  which  may  have  introduced  a 
downward  bias  in  valuation  responses  because  of  the  unpopularity  of  taxes 
generally.   Letting  respondents  choose  their  vehicles,  as  was  done  in  CF, 
appears  to  be  more  neutral  in  this  regard.   On  the  other  hand,  the  incentive 
properties  of  the  contingent  valuation  questions  in  WA  and  CF  are  not  clear. 
Considering  all  these  aspects,  I  assigned  WA  3  points  and  EV,  4  points  for 
context.   This  compares  to  CF's  3  points  on  this  item. 

All  three  studies  clearly  elicited  willingness  to  pay  and  received  full 
marks  on  Question  8. 

The  acceptability  and  believability  of  the  scenarios  (Questions  9  and  10 
respectively)  can  be  considered  together.   Recall  the  difference.   Quoting 
Appendix  B,  "A  study  subject  accepts  the  scenario  when  he  or  she  implicitly 
agrees  to  proceed  with  the  valuation  exercise  based  on  the  information  and 
context  provided."   A  scenario  is  "believed"  to  the  degree  that  respondents 
expect  their  responses  to  the  contingent  valuation  question  to  actually  affect 
how  much  of  the  environmental  amenities  in  question  they  will  receive  and  how 
much  they  will  actually  pay  if  the  intervention  is  adopted.   A  scenario  is 
acceptable,  yet  not  believed  when  respondents  agree  to  engage  in  "what  if" 
exercises  where  they  know  that  the  scenario  is  completely  hypothetical . 

Once  again  EV  ranks  higher  than  the  other  two  studies.   One  of  the  goals 
of  the  extensive  efforts  that  went  into  the  design  of  the  EV  survey  instrument 
was  to  develop  a  final  scenario  that  would  be  both  acceptable  to  and 
believable  by  respondents.   Debriefing  questions  verified  that  this  goal  was 
largely  achieved.   EV  did  not  get  full  scores  on  Questions  9  and  10  because 
the  tax  vehicle  may  have  impeded  acceptance  of,  and  belief  in,  the  scenario. 
That  a  federal  income  tax  would  be  levied  for  one  year  to  establish  an  escort 
ship  program  for  oil  tankers  in  Prince  William  Sound  may  have  seemed  a  bit 
implausible  to  some  respondents  given  the  lack  of  precedent  for  such  a  tax. 
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Also,  the  valuation  questions  explicitly  stated  that  the  intervention  would 
cost  the  household  in  question  a  specified  amount  in  taxes.   This  was  bound  to 
raise  questions  in  the  minds  of  some  respondents  about  how  the  interviewer 
could  know  the  exact  implications  of  the  intervention  for  that  household's 
taxes . 

WA  raised  serious  concerns  with  regard  to  acceptability  and 
believability .   Logging  of  even  one  wilderness  area  would  be  a  major 
environmental  battle.   A  proposal  to  log  all  57  wilderness  areas  in  the  four 
states  would  have  generated  a  tremendous  furor  that  would  have  occupied  the 
national  and  regional  press  endlessly.   McFadden  (1994,  p.  696)  pointed  out, 
"This  resource  issue  was  chosen  because  at  the  time  of  the  study  in  1990  there 
was  active  discussion  in  Congress  and  in  the  media  in  the  western  U.S. 
regarding  logging  on  government  lands,  making  logging  familiar  to  many 
respondents."   However,  logging  on  federal  lands  is  one  thing,  logging  in 
designated  wilderness  areas  is  quite  another.   There  are  two  aspects  of  WA 
that  are  likely  to  seem  especially  implausible  to  respondents,  that  a 
specified  number  of  wilderness  areas  (7,  8,  or  9  depending  on  the  survey 
version)  had  already  been  slated  for  logging  without  their  realizing  it  and 
that  all  57  wilderness  areas  in  four  states  were  being  considered  for  logging. 
Not  much  is  known  about  how  members  of  the  public  are  likely  to  respond  when 
confronted  over  the  phone  with  a  major  policy  change  that  they  find 
implausible  and  about  which  they  have  not  previously  heard  or  read.   Designers 
of  the  survey  apparently  were  aware  that  respondents  might  have  a  negative 
reaction  to  the  logging  proposal.   Interviewers  were  instructed  to  tell 
respondents  expressing  concern  about  the  issue,  that  logging  in  wilderness 
areas  "...is  not  currently  being  considered  by  lawmakers."   Outrage,  suspicion 
about  the  motives  for  and  true  goals  of  the  survey,  refusal  to  participate  in 
the  survey  or,  if  they  proceed,  to  take  the  survey  seriously,  and  curiosity 
about  details  (in  this  case  not  provided)  are  all  plausible  reactions.   Such 
reactions  create  doubts  about  the  quality  of  resulting  data.   Diamond  et  al . 
(1992  p.  22)  reported,  "Overall,  the  response  rate  was  62  %."   Unfortunately, 
they  do  not  elaborate  on  how  this  was  calculated  or  how  the  non-response  was 
broken  down.   If  this  number  expresses  interview  completions  as  a  percentage 
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that  tar.  be  presented  effectively  ir.  a  telephone 
cannot  be  supported  by  visual  aids  such  as  raps,  diagra~s   and  photographs 
Also,  constraints  cr.  tire  ::  cot-piece  the  survey  may  ce  more  severe  for 
telephone  survey;  than  for  either  rail  or  m-person  surveys.   Ever,  if  the 
researchers  conducting  WA  had  done  more  to  document  which  attributes  of 
wilderness  areas  were  important  tc  potential  respondents  ant  had  thoroughly 
dctu-er.ted  the  effects  tf  h.ry:.:  ::.  those  attributes   I  believe  the  choice  of 
a  telephone  survey  mode  would  have  prevented  then  from  adequately 
communicating  what  they  had  learned.   My  pessimism  about  the  potential  for 
oeleohone  CV  surveys  led  -e  tc  assign  HA  only  3  points  fir  survey  _::e 

Regarding  Question  13,  HA  involved  three  fcrr.al  pretests  -it.t  fairly 
large  samples  and  including  deoriefir.g  questions  tt  atte-pt  tt  investigate 
whether  respondents  understood  the  srer.arit.  Z    assigned  it  1  point  .ess  here 
than  ZT   received  ceca_se  the  latter  ir.cl-ded  not  only  pretests  o_t  a.so  verbal 
protocols.    EV  received  a  full  E  ptir.ts  because  tf  the  potentially  -ire 
effective,  very  extensive  trooed_res  it  followed   which  involved  apt_icacicr. 

Turning  tt  2.uescitr.  14   all  three  st_dies  fslltwed  stunt  survey 

execution  prttedures  and  reteived  high  stores.   BV  resetted  the  t 3  points 

Sampling  was  attt-plished  using  a  multi-stage  area  probability  sample.   This 
sample  was  constructed  by  first  randomly  choosing  31  counties  tr  :::.:=   then 
randomly  selettir.g  33:  cltths  within  counties   and  finally  selecting  1  SQC 
residences  front  the  selected  ditch s.   This  type  cf  sa-plir.g  provides  per  tats 
the  test  leverage  cf  the  sa-pling  frame.   Standard   high  quality  prtced_res 
were  followed  it  gathering  data.   BV  reptrted  a  resptr.se  rate  tf  "3  3  percent 
This  was  calculated  as  completions  divided  by  the  number  cf  pctential 
interviews  less  non-English  speaking  households  and  vacant  residences    rhe   - 
sample  was  obtained  usir.g  rar.de-  digit  diali.tr    T.te  coverage  cf  this  type  of 
sa-ple  is  quite  gttd   the  -=yr  deficiency  ceir.r  the  less  cf  households 
without  telephones.   HA  reptrted  that  up  tt  10  follow-up  phone  calls  -ere  -ate 
to  atte-pt  to  gait  a  response,  which  seems  quite  laudahle.   Depending  -tor.  he 
it  was  calculated,  the  €2  percent  reported  respor.se  rate  it  HA  -ay  raise 


assigned  4  points,  reflecting  its  apparent  low  response  rate.    For  reasons 
already  noted,  CF  received  4  points  on  this  item. 

Regarding  econometric  procedures  (Question  15) ,  WA,  and  particularly  the 
analysis  of  McFadden  (1994),  is  outstanding.2   The  design  of  the  analysis  is 
driven  by  economic  theory,  is  very  thorough,  and  explores  a  number  of 
functional  forms  as  well  as  a  non-parametric  procedure.   Thus,  I  assigned  WA  a 
5  points  for  econometrics.   CF  and  EV  are  quite  adequate  in  this  regard  as 
well  but  perhaps  less  innovative  and  were  assigned  ratings  of  4 . 

Turning  to  Question  16,  the  written  materials  from  EV  were  very  thorough 
and  explicit  regarding  the  procedures  followed  and  the  results  of  the 
analysis,  warranting  full  points.   WA  did  not  have  a  complete  report  for  easy 
reference,  but  what  was  available  could  be  pieced  together  fairly  well  from 
the  various  papers  that  have  been  written.   I  assigned  it  a  4  for  reporting, 
the  same  as  CF  received  on  this  item. 

Reasons  for  assigning  CF  only  8  out  of  the  possible  10  points  under 
Question  17  focused  on  how  embedding  was  handled.   No  other  issues  arose  in  my 
reviews  of  WA  and  EV  and  I  assigned  them  the  full  10  points  there. 

The  level  of  content  validity  of  the  three  studies  can  be  compared  by 
examining  Table  2.   EV  did  quite  well,  earning  95  points.   WA  was  sufficiently 
weak  on  enough  items  to  earn  only  57.   The  Clark  Fork  study  was  in  between  at 
73  points. 


2I  should  note  that  some  of  the  statistical  procedures  applied  by  Diamond 
et  al.  (1992)  have  been  vigorously  criticized  in  a  paper  by  Carson  and  Flores 
(1992)  . 
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Table  2:   Comparative  Content  Validity  Ratings  for  the  Wilderness  Area  (WA) 
Clark  Fork  (CF) ,  and  Exxon  Valdez  Oil  Spill  (EV)  studies 


Question 

WA 

CF 

EV 

1.   Fatal  flaws? 

No 

No 

No 

2.   True  value  defined? 

3 

3 

4 

3.   Attributes  identified? 

2 

7 

10 

4 .   Effects  documented  and 
communicated? 

2 

6 

10 

5.   Aware  of  environmental  and 
other  substitutes? 

3 

4 

5 

6.   Budget  constraint  stressed? 

4 

4 

5 

7.   Context  specified  and  incentive 
compatible? 

3 

3 

4 

8.   WTP  elicited? 

5 

5 

5 

9.   Scenario  accepted? 

2 

4 

4 

10.  Scenario  believed? 

2 

3 

4 

11.  Other  questions? 

3 

4 

5 

12.  Survey  mode  appropriate? 

2 

6 

10 

13.  Potential  flaws  identified? 

3 

4 

5 

14.  Survey  procedures? 

■i 

4 

5 

15.  Econometrics 

5 

4 

4 

16.  Written  materials  adequate? 

4 

4 

5 

17.  Other  issues 

10 

8 

10 

Total  Points 

57 

73 

95 
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What  about  construct  validity?   I  would  rate  CF  and  EV  as  roughly 
equivalent  in  terms  of  their  overall  construct  validity.   That  is,  both  are 
toward  the  lower  end  of  Level  3 .   Unfortunately,  EV  was  also  done  before  the 
advanced  construct  validity  tests  began  to  be  taken  so  seriously.   Hence,  it 
has  no  between- sample  scope  test.   While  it  did  not  fail  such  a  test,  neither 
did  it  pass  one.   EV  does  have  a  very  strong  valuation  equation.   An  indirect, 
within- sample  scope  test  of  sorts  can  be  conducted  by  examining  this  equation 
(Carson  et  al .  1992,  p.  5-108).   One  set  of  dummy  variables  represents 
respondents  views  about  the  potential  seriousness  of  oil  spills  if  the  escort 
ship  program,  which  served  as  the  intervention,  were  not  adopted.   These 
variables  can  be  interpreted  as  respondents'  perceptions  of  the  scale  of  the 
injury  without  the  program.   Those  who  felt  that  the  injuries  would  be  a  great 
deal  more  or  somewhat  more  than  was  caused  by  the  Exxon  Valdez  spill  were 
willing  to  pay  significantly  more  for  the  program,  while  those  who  felt  that 
the  injuries  would  be  less  or  that  there  would  be  no  injuries  were  willing  to 
pay  less.   Likewise,  those  who  felt  that  the  escort  ship  program  would  not  be 
very  effective  in  preventing  injuries  or  would  not  be  at  all  effective  were 
willing  to  pay  less  on  average  than  those  who  thought  the  program  would  work. 
Not  only  are  all  of  these  relationships  significant  at  the  10  percent  level, 
but  they  have  the  correct  relative  magnitudes.   This  is  a  somewhat  stronger 
within-sample  scope  test  than  that  in  CF  because  it  does  not  involve  direct 
comparisons  of  scenarios  but  respondents'  perceptions  regarding  a  single 
scenario.   Thus,  it  avoids  the  fear  that  multiple  scenarios  will  lead 
respondents  to  give  answers  in  accordance  with  expectations  of  the 
researchers . 

WA,  on  the  other  hand,  appears  to  have  much  lower  construct  validity. 
Valuation  equations  presented  in  McFadden  (1994)  show  mixed  results. 
Potential  explanatory  variables  are  sometimes  significant  predictors  of 
willingness  to  pay,  but  at  other  times  are  not.   McFadden  tested  the 
hypothesis  that  income  had  the  same  effects  across  several  treatments  and 
could  not  reject  it,  a  positive  result.   However,  scope  tests  were  soundly 
failed  as  were  other  tests  based  on  theory  (McFadden  1994;   Diamond  et  al . 
1992) .   I  would  assign  WA  to  the  lower  end  of  Level  2,  implying  that  its 
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results  might  be  useful  for  scientific  purposes,  but  would  need  to  be  applied 
only  with  heavy  caveats,  if  at  all,  in  policy  analysis  and  damage  assessments. 


COMMENTS  ON  THE  SCIENTIFIC  STATUS  OF  CONTINGENT  VALUATION 

The  courts  are  often  asked  to  evaluate  the  merits  of  expert  scientific 
testimony.   Paraphrased  to  simplify  a  bit,  the  following  four  guidelines  have 
been  suggested.  (113  5.Ct.2786  (1993)) 

1.  Does  the  method  generate  hypotheses  and  can  these  hypotheses  be 
tested? 

2.  Has  the  method  been  subject  to  peer  review  and  publication? 

3.  Is  the  rate  of  error  known  and  are  there  standards  for  controlling 
the  technique's  operation? 

4 .  Has  the  method  obtained  a  general  acceptance  among  a  relevant 
scientific  community? 

Of  course,  I  am  in  no  position  to  express  opinions  on  legal  issues. 
Nevertheless,  I  thought  it  might  be  helpful  to  give  one  economist's  answer  to 
these  questions  as  they  apply  to  contingent  valuation. 

On  the  first  question,  there  seems  to  be  no  doubt.   The  goal  of 
construct  validity  assessment  is  to  test  hypotheses  about  relationships 
between  results  from  contingent  valuation  studies  and  expectations  from 
economic  theory.   Theory-driven  hypothesis  testing  is  central  to  contingent 
valuation. 

Regarding  peer  review  and  publication,  a  number  of  contingent  valuation 
studies  have  appeared  in  prominent,  peer- reviewed  economic  journals  including 
the  American  Economic  Review,  the  Quarterly  Journal  of  Economics,  the  American 
Journal  of  Agricultural  Economics.  Land  Economics,  and  the  Journal  of 
Economics  and  Environmental  Management. 

Turning  to  the  third  question,  errors  can  exist  on  two  levels. 
Statistical  error  (the  error  associated  with  sampling)  can  certainly  be 
evaluated  for  contingent  values  and  confidence  intervals  around  estimates  of 
mean  values  can  be  derived  from  the  same  statistics,  as  can  estimates  of  the 
standard  deviation  of  contingent  values  for  populations  of  individuals.   In 
this  sense,  error  levels  for  contingent  values  are  as  quantifiable  as  those 
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for  any  other  economic  estimates.   On  a  more  difficult  level,  the  direction 
and  magnitude  of  the  bias  (defined  as  the  observed  value  minus  the  true  value) 
in  any  given  value  estimate  based  on  contingent  valuation  is  not  known  since 
true  values  are  not  observable.   However,  this  is  true  for  all  economic  value 
estimates.   As  to  whether  the  rate  of  error  can  be  controlled,  errors  can  be 
minimized  by  designing  and  executing  studies  with  high  content  validity,  a 
point  of  view  that  I  share  with  the  NOAA  Panel  and  the  writers  of  the  proposed 
NOAA  and  DOI  rules.   Furthermore,  construct  validity  tests  provide  important 
clues  about  which  studies  are  more  likely  to  be  accurate  and  which  are  more 
likely  to  contain  large  errors.   Criterion  validity  studies  should  eventually 
provide  further  insights  about  the  error  levels  in  contingent  valuation 
studies,  but  as  noted  earlier,  the  evidence  so  far  is  mixed. 

Regarding  whether  contingent  valuation  is  generally  accepted,  it  is  true 
that  the  method  remains  a  subject  of  intense  debate.   Prior  to  the  Exxon 
Valdez  oil  spill,  I  think  it  is  fair  to  say  that  the  method  was  gaining 
substantial  inroads  toward  acceptance  among  specialists  in  environmental  and 
resource  economics.   While  a  few  were  extremely  skeptical,  many  were  quite 
interested  in  developing  and  applying  the  method.   Summer  meetings  of  the 
American  Agricultural  Economics  Association  and  winter  meetings  of  the  Allied 
Social  Science  Association  devoted  sessions  to  recent  papers  on  the  topic  each 
year.   These  sessions  were  almost  invariably  well  attended.   A  large  group  of 
researchers  from  Agricultural  Experiment  Stations  across  the  country  formed 
Western  Regional  Committee  W-133,  which  held  meetings  annually  focused  on 
nonmarket  valuation,  attracting  not  only  its  members  but  dozens  of  researchers 
and  government  economists.   Many  of  the  papers  presented  there  focused  on  the 
contingent  valuation  method.   The  result  of  all  this  effort  was  increasing 
acceptance  of  the  method  by  academic  resource  economists  and  federal  agencies, 
who  saw  the  technique  as  a  promising  new  approach  to  measuring  previously 
unquantif iable  benefits  and  costs.   The  U.S.  Environmental  Protection  Agency; 
Department  of  the  Interior  agencies  including  the  Bureau  of  Reclamation,  the 
Fish  and  Wildlife  Service,  and  the  Bureau  of  Land  Management;  the  U.S.  Army 
Corps  of  Engineers;  and  U.S.  Department  of  the  Agriculture  agencies  such  as 
the  Forest  Service  and  the  Soil  Conservation  Service  all  authorized  use  of 

Page   39 


contingent  valuation  for  policy  analysis.   Scholarly  interest  spread  beyond 
the  U.S.   Studies  have  now  been  completed  in  Canada,  Australia,  New  Zealand, 
the  Republic  of  China,  Great  Britain,  Germany,  Norway,  Italy  and  other 
countries . 

Following  the  Valdez  spill,  and  largely  because  of  the  proposed  use  of 
the  method  in  natural  resource  damage  assessment  cases,  prominent  mainstream 
economists  took  an  increasing  interest  in  the  reliability  of  the  method.   This 
has  had  the  benefit  of  focussing  the  contingent  valuation  debate  on  issues  of 
reliability.   Some  have  staked  out  extreme  positions  on  both  sides  of  the 
issue,  but  most,  including  the  economists  on  the  NOAA  Panel,  have  focused 
attention  on  the  determinants  of  where  the  method  will  work  well  and  where  it 
will  not.   This  new  focus  seems  to  be  moving  in  the  direction  of  seeking  to 
identify  specific  actions  within  the  control  of  the  researcher  that  will 
enhance  the  validity  of  the  results  and  on  the  interpretation  of  evidence  from 
theory-driven  hypothesis  testing.   Interest  in  research  on  the  topic  is 
growing  rapidly,  as  is  evidenced  by  a  recently  announced  program  of  support 
sponsored  jointly  by  the  Environmental  Protection  Agency  and  the  National 
Science  Foundation  to  explore  various  issues  in  environmental  valuation, 
including  contingent  valuation.   This  in  itself  is  significant  evidence  that 
many  at  high  levels  of  academia  and  government  continue  to  consider  the  method 
useful  and  promising.   The  validity  of  the  contingent  valuation  remains  the 
focus  of  heated  debate.   However,  among  those  most  knowledgeable  about  the 
technique,  there  is  substantial  support. 


CONCLUSIONS 

This  final  section  synthesizes  my  review  in  order  to  reach  a  conclusion 
about  validity  of  the  Clark  Fork  Study.   Let  me  begin  by  reemphasizing  that  it 
is  important  to  interpret  properly  the  concerns  raised  during  content  validity 
assessment.   Such  assessments  involve  a  search  for  possible  pitfalls  in  the 
procedures  used.   For  most  of  the  issues  that  arise  during  content  validity 
assessments,  it  is  difficult  to  be  sure  whether  or  not  a  serious  problem  is 
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actually  present.   For  example,  this  was  the  case  when  I  expressed  doubts 
about  how  effective  the  Clark  Fork  study  was  in  communicating  the  potential 
effects  of  complete  and  partial  cleanup.   In  addition  to  survey  communication, 
relatively  serious  questions  arose  about  attribute  identification  procedures, 
choice  of  survey  mode,  the  believability  of  the  scenarios,  and  other  issues. 
Less  serious  concerns  were  raised  in  a  number  of  other  areas,  as  well.   For 
example,  I  allowed  only  4  points  out  of  5  for  econometrics.   This  was  not  to 
indicate  that  I  found  anything  wrong,  but  simply  that  more  might  have  been 
done.   In  terms  of  content  validity,  the  conclusion  is  that  the  study  is 
strong  in  many  respects,  but  could  have  been  better  designed  and  executed  in  a 
number  of  ways . 

Since  the  content  validity  assessment  identified  some  possible 
difficulties,  I  entered  the  construct  validity  phase  of  the  assessment  with 
some  concerns  about  how  valid  the  final  results  would  be.   However,  the 
outcome  of  the  construct  validity  assessment  was  more  reassuring.   Valuation 
equations  and  success  in  the  within-sample  scope  tests  indicated  that 
respondents  dealt  sufficiently  well  with  the  information  provided  them  to 
answer  in  ways  that  economic  theory  would  predict.   This  supports  interpreting 
the  contingent  values  as  valid  estimates  of  the  true  values.   While  a  number 
of  potential  concerns  about  study  procedures  arose  during  content  validity 
assessment,  the  construct  validity  assessment  indicated  that  the  study 
procedures  worked  rather  well. 

Above  this  rather  positive  conclusion  is  one  dark  cloud,  the  failure  of 
the  Clark  Fork  study  to  pass  the  between-sample  scope  test  for  the  full 
sample.   However,  my  conclusion  is  that  it  would  be  easy  to  make  too  much  of 
this  one  concern.   Several  rather  innocuous  reasons  for  this  failing  have  been 
considered.   While  I  would  feel  a  little  more  confident  about  the  Clark  Fork 
study  if  it  had  passed  the  between-sample  scope  test  for  the  full  sample  with 
flying  colors,  it  would  be  a  mistake  to  allow  this  one  anomaly  to  overshadow 
its  many  positive  results.   Furthermore,  when  comparing  subsample  values,  the 
Clark  Fork  study  successfully  passes  a  between-sample  scope  test  at  the  11 
percent  level.   While  it  would  have  been  a  stronger  result  if  the  significance 
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level  had  been  10  percent  or  less,  this  positive  evidence  regarding  scope  was 
reassuring. 

Comparing  the  Clark  Fork  study  to  two  other  well-known  studies  was 
designed  to  place  it  in  a  broader  perspective.   The  Clark  Fork  study  compared 
well  to  the  leading-edge  (and  very  expensive)  Exxon  Valdez  study  and,  in  my 
opinion,  was  clearly  superior  to  the  wilderness  area  study,  a  study  that  led 
to  a  major  new  publication  on  the  contingent  valuation  method.   This  supports 
the  conclusion  that,  though  it  may  not  be  quite  as  strong  as  the  Exxon  Valdez 
study,  the  Clark  Fork  study  should  command  a  relatively  high  level  of 
confidence . 

In  my  opinion,  then,  results  from  the  Clark  Fork  study  have  sufficient 
validity  to  be  used  in  measuring  the  true  values  of  Montana  residents  for 
partial  and  complete  cleanup  at  the  Clark  Fork  sites. 
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In  keeping  with  the  title  assigned  to  us,  this  paper  will  focus  on 
measurement.   We  will  devote  far  less  time  to  the  theory  of  non-use  values  in  a 
welfare  theoretical  framework  than  many  readers  might  expect.   Non-use  values 
are  now  well  entrenched  in  the  theory  of  the  consumer.   We  shall  want  to 
briefly  review  the  theory  as  a  foundation  for  what  will  follow,  but  need  not 
dwell  on  it  at  length.   The  central  focus  will  instead  be  on  empirical 
measurement . 

To  date,  contingent  valuation  (CV)  is  the  only  tool  for  estimating  non-use 
values  that  has  a  substantial  following  among  researchers.   As  most  readers  are 
no  doubt  aware,  CV  is  currently  the  subject  of  a  raging  debate  in  the  U.S. 
Respected  researchers  currently  disagree  about  whether  CV  can  produce 
sufficiently  accurate  values  to  support  damage  assessment  and  policy  analysis. 
This  debate  may  have  already  begun  in  other  countries  and  will  almost  certainly 
intensify  there  as  applications  of  CV  expand.   The  premise  of  this  paper  is 
that  progress  in  this  debate  is  impeded  by  the  lack,  within  economics,  of  a 
widely  accepted  theory  of  measurement. 

Mitchell  and  Carson  (1989) ,  drawing  on  the  psychological  theory  of  testing 
(Bohrnstedt  1983) ,  have  laid  a  foundation  for  such  a  theory.   This  paper  will 
attempt  to  elaborate  on  their  work  with  specific  reference  to  CV  methods  for 
estimating  non-use  values.   In  psychology,  such  abstract  concepts  as 
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2 
intelligence  have  long  been  the  subject  of  research.   Tests,  such  as  IQ  tests, 
are  a  major  tool  in  psychometrics .   We  economists  today  face  problems  that  are 
in  many  ways  comparable  to  the  problems  that  pioneers  in  psychology  must  have 
faced.   We  have  the  abstract  concepts  of  willingness  to  pay  (WTP)  and 
willingness  to  accept  (WTA)  that  we  hope  to  measure.   WTP  and  WTA,  like 
intelligence,  are  fundamentally  abstract,  theoretically  defined  concepts  that 
are  not  directly  and  fully  observable  in  the  real  world.   Both  economists  and 
psychologists  must  try  to  infer  something  about  the  magnitudes  of  their 
constructs  through  analyses  of  data  that  are  thought  to  be  indicative  of  those 
magnitudes.   In  this  endeavor,  both  must  rely- -implicitly  or  explicitly- -on  a 
theory  of  measurement  to  guide  research  design  and  the  interpretation  of 
results.   In  fact,  one  might  reinterpret  CV  questions  as  "WTP  tests"  analogous 
to  IQ  and  other  psychological  tests.   Thus,  an  economic  theory  of  measurement 
has  much  to  learn  from  the  psychometric  theory  of  measurement. 

We  begin  by  defining  theoretical  WTP,  a  concept  that  will  play  the  role  in 
our  theory  of  measurement  that  the  concept  of  "true  value"  sometimes  plays  in 
psychometrics.   That  is,  the  validity  of  observed  WTP  can  only  be  evaluated 
with  reference  to  inherently  unobservable  theoretical  WTP  values.   To  address 
the  thorny  relationships  between  observable  and  unobservable  values,  we  will 
consider  the  dual  criteria  of  reliability  and  validity  and  adapt  these  concepts 
to  reflect  the  goals  of  CV  studies.   We  then  turn  to  triad  of  concepts  that 
might  be  termed  the  "Three  C's":   content  validity,  construct  validity,  and 
criterion  validity.   These  concepts  represent  mutually  reinforcing  strategies 
for  assessing  the  validity  of  CV  studies.   Much  of  paper  focuses  on  how  these 
strategies  can  be  applied  to  assess  the  validity  of  individual  CV  studies.   We 
will  also  argue  that  validity  assessment  at  the  level  of  the  individual  study 
is  a  prerequisite  for  trying  to  draw  empirical  conclusions  about  the  validity 
of  the  CV  method  as  a  whole. 

Theoretical  WTP 

Let  us  work  with  WTP,  always  bearing  in  mind  that  WTA  is  a  second  measure 


3 
of  economic  value  with  equal  theoretical  standing.   To  define  the  theoretical 
WTP,  let  us  consider  the  effects  of  an  "intervention"  in  the  economy  on  the 
economic  welfare  of  the  idealized,  utility-maximizing  consumer  of  economic 
theory.   Such  an  intervention  may  take  the  form  of  proposed  governmental 
policies,  projects,  or  regulations  that  will  in  some  way  affect  the  natural 
environment.2   Or,  the  intervention  might  be  a  release  of  oil  or  a  toxic 
substance  that  affects  the  environment.   In  either  case,  the  intervention  will 
affect  economic  parameters  such  as  prices  paid  or  income  received  by  the 
consumer  or  environmental  parameters  that  may  be  relevant  to  the  idealized 
consumer  for  a  variety  of  reasons.   Some  environmental  parameters  may  be 
relevant  for  uses  of  the  environment  by  the  consumer.   For  example,  the 
intervention  may  affect  fish  populations  where  the  consumer  goes  fishing. 
Other  parameters  may  be  relevant  to  the  consumer  for  reasons  other  than 
personal  use.   For  example,  the  consumer  may  be  altruistic  toward  others  who 
are  users  or  toward  animals  or  toward  future  generations.   It  is  through 
effects  on  environmental  parameters  that  interventions  generate  environmental 
"values . " 

Let  us  represent  the  consumer's  indirect  utility  function  by 

v(P,d,Y) 

where  P  symbolizes  a  vector  of  market  prices,  d  symbolizes  the  status  of  the 

environmental  resource  that  would  be  affected  by  the  intervention,  and  Y 
symbolizes  income.   Let  P0,  d0,  and  Y0  represent  the  levels  of  these  parameters 

in  the  absence  of  the  intervention  and  Pw,  d,,,  and  Yw  represent  the  levels  of  the 

parameters  if  the  intervention  is  completed.  Let  WTPC  represent  the  theoretical 

WTP  associated  with  the  intervention.   If  the  intervention  has  a  positive 
effect  on  the  consumer,  that  is, 

v(Pw,dw,YJ  >  v(P0,d0,Y0)  , 
then  theoretical  willingness  to  pay  is  defined  implicitly  by 


2  We  shall  limit  the  discussion  to  interventions  that  affect  the  environment, 
although  everything  said  is  directly  applicable  to  other  types  of  interventions. 


v(Pw,  c^.Y,, -WTPt)  =  v(P0,d0,Y0)  . 

On  the  other  hand,  if  the  intervention  will  make  the  consumer  worse  off,  that 
is, 

v(Pw,dw,Yw)  <  v(P0,d0,Y0)  , 

then  the  implicit  definition  becomes 

v(Pw.dw.Yw)  =  v(Po,d0,Yo-WTPc)  . 

Where  the  intervention  leads  to  an  improvement  in  welfare,  WPTt  is  most 
naturally  defined  as  Hicksian  compensating  surplus,  while  it  should  be 
interpreted  as  Hicksian  equivalent  surplus  if  welfare  will  decline.3 

WTPC  as  just  defined  must  be  viewed  as  a  theoretical  construct,  a  useful 
scientific  fiction  describing  what  would  be  measured  under  ideal  circumstances. 
We  economists  use  the  WTPt  construct  with  such  ease  and  confidence  that  it  takes 
on  a  reality  of  its  own  in  our  minds,  though  it  does  not  exist  in  reality.   The 
utility  maximization  framework  is  a  highly  stylized  version  of  how  real  human 
beings  think  and  behave.   Consider,  for  example,  how  we  conceptualize  the 
process  by  which  utility  is  maximized.   Do  we  really  believe  that  human  beings 
sit  down  at  one  point  in  time  and  plan  what  they  intend  to  consume  during  the 
next  period  of  time?   What  unit  of  time  are  we  thinking  of,  a  month,  a  year,  a 
lifetime?   Is  the  "utility  counter"  set  at  zero  on  January  1  of  each  year? 
Human  choice  must,  in  reality,  be  a  much  more  dynamic,  complex  process  than  the 
simple  theory  allows. 

The  very  concept  of  "consumer"  involves  a  gross  abstraction  from  reality. 
Do  we  mean  that  consumption  decisions  are  made  by  individual  human  beings 
acting  in  isolation  from  those  around  them?   This  does  not  seem  to  be  in 


3  WTA  can  be  defined  in  parallel  fashion.  If  the  intervention  improves  welfare, 
then  WTA  represents  the  minimum  compensation  that  the  citizen  would  accept  in  lieu 
of  the  intervention.  If  the  intervention  reduces  welfare,  then  WTA  is  the  minimum 
compensation  that  would  have  to  be  paid  to  the  citizen  before  he  or  she  would  find 
the  intervention,  after  compensation,  acceptable. 
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keeping  with  reality,  where  groups  of  individuals  (couples,  families,  unrelated 
individuals  who  share  dwellings)  make  joint  decisions,  often  subject  to 
constraints  that  involve  multiple  incomes.   Thus,  it  is  not  unusual  to  find 
discussions  of  consumption  behavior  that  refer  to  the  consuming  unit  as  a 
"household."   However,  more  rigorous  treatments  of  the  subject  refer  only  to 
individuals  as  consuming  units  because,  once  group  decision  processes  are 
permitted,  basic  assumptions  about  preferences  such  as  completeness, 
transitivity,  and  continuity  are  called  into  question.   These  assumptions 
underlie  the  utility  functions  needed  to  define  WTPt  (and  theoretical  WTA)  and 
to  develop  hypotheses  about  the  relationships  between  values  and  other  economic 
parameters.   To  the  extent  that  the  theoretical  concept  of  "the  consumer"  has 
no  full  counterpart  in  the  real  world,  neither  does  WTPt .   WTPt  is  in  principle 
unobservable  because  it  does  not  exist  in  reality. 

Why  deal  in  such  abstractions  at  all?   People  are  willing  to  pay  some 
amounts,  but  not  other  amounts,  for  conventional  goods  and  services.   It  is  not 
much  of  a  leap  to  infer  that  real  people  might  also  be  willing  to  pay  something 
to  obtain  environmental  amenities  and  avoid  environmental  "bads."   Why  not  stop 
there,  defining  theoretical  WTP  as  the  maximum  willingness  to  pay  of  real 
people?   The  answer  is  that  such  an  approach  would  not  be  very  rich 
theoretically  speaking.   It  would  not  readily  yield  testable  hypotheses  about 
how  willingness  to  pay,  thus  defined,  might  be  related  to  the  other  economic 
parameters  confronted  by  these  same  real  world  people.   As  soon  as  researchers 
try  to  consider  such  potential  relationships,  however,  they  would  immediately 
be  overwhelmed  by  the  complexities  and  ambiguities  that  are  inherent  in  the 
behavior  of  real  people.   Such  complexities  and  ambiguities  only  become 
tractable  through  abstraction.   Whether  one  wishes  to  consider  the  simplest 
relationships  (e.g.,  the  relationship  between  WTP  and  income)  or  more  complex 
ones  (e.g.,  the  relationships  between  WTP  for  two  or  more  commodities  or  WTP 
for  public  goods)  abstract  "modelling"  of  economic  behavior  is  necessary.   The 
usefulness  of  consumer  theory  and  welfare  economics  lies  in  the  guidance  they 
provide  in  designing  empirical  studies  and  interpreting  results.   Gains  of  this 


6 
kind  are  only  realized,  however,  by  assuming  away  potentially  important  parts 
of  reality. 

Market  data  have  been  central  to  efforts  to  quantify  economic  values. 
This  is  the  so-called  revealed  preference  approach  to  value  estimation.   People 
spend  money  on  goods  and  services,  and  such  spending  has  great  credibility  as 
evidence  of  the  values  people  place  on  goods  and  services . 

Without  quarreling  at  all  with  the  credance  that  economists  place  on 
market  data,  we  should  like  to  emphasize  that  market  values  are  not  direct 
observations  on  WTPC.    This  is  the  case  for  both  conceptual  and  empirical 
reasons.   From  a  conceptual  perspective,  to  admit  that  economic  theory  is 
abstract  is  to  admit  that  it  may  not  and  probably  will  not  fully  represent 
decision  processes  of  real  world  consumers  in  the  marketplace.   People  making 
market  choices  will  hopefully  engage  in  processes  and  apply  criteria  like  those 
depicted  in  theory.   Nevertheless,  to  say  that  theory  is  abstract  is  to  say 
that  what  we  shall  term  "other  factors,"  factors  not  considered  in  the  theory, 
may  also  impinge  on  consumer  choices.   Other  factors  could  include,  for 
example,  the  influences  of  group  decision  making  within  the  household  and 
strategies  devised  by  real  human  beings  to  cope  with  uncertainty.   To  the 
extent  that  other  factors  enter  into  market  choices,  market  values  may  diverge 
from  WTPC." 

In  addition  to  theoretical  concerns  about  interpreting  market  values  as 
WTP,-,  it  must  also  be  remembered  that  empirical  estimation  of  demand  and  supply 
relationships  inevitably  involves  simplifying  assumptions,  errors  in  variables, 
model  specification  errors,  missing  data,  extrapolations  beyond  the  range  of 
the  data,  and  other  potential  sources  of  error.   Benefit-cost  analysis  almost 
invariably  involves  extrapolations  of  values  estimated  based  on  past  behavior 


4  Care  must  be  taken  in  this  sort  of  discussion  to  avoid  making  theory  into 
a  straw  man.  Economists  interested  in  household  decision  making,  decision  making 
under  uncertainty,  and  other  economic  problems  do  develop  much  richer  theoretical 
models  than  those  typically  drawn  upon  in  conceptualizing  WTPt  for  purposes  of 
applied  welfare  analysis.  Perhaps  richer  theories  will  eventually  yield  new 
insights  and  testable  hypotheses  for  valuation  studies,  but  they  have  yet  to  do 
so.  To  date,  the  simpler  models,  which  intentionally  leave  numerous  potentially 
important  other  factors  out  of  the  analysis,  have  dominated  discussions  of  WTPt . 
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into  the  future,  a  process  that  is  far  from  perfect.   Thus,  even  if  real  world 
consumers  behaved  exactly  like  the  consumer  of  economic  theory,  econometric 
errors  would  make  estimated  market  values  imperfect  measures  of  WTPC . 

Votes  for  or  against  referenda  involving  commitments  to  pay  taxes  or  other 
fees  might  also  provide  credible  evidence  for  estimating  WTPC,  although 
questions  do  arise  about  such  an  interpretation.   It  is  conceivable,  for 
example,  that  people  who  themselves  would  favor  a  proposition  (indicating  that 
their  WTP  values  exceeding  the  monetary  commitments)  might  still  vote  against 
it  if,  for  example,  they  felt  that  it  was  unfair  to  impose  the  tax  on  others. 
To  call  referenda  "political  markets"  is  perhaps  stretching  the  concept  of 
markets  a  bit.   Nevertheless,  the  fact  that  voting  for  such  referenda  involves 
real  commitments  to  pay  gives  voting  considerable  clout  as  evidence  about 
economic  values.   Referenda  often  involve  public  goods,  making  them 
particularly  relevant  in  considering  non-use  values,  since  the  existence  of 
environmental  resources  may  be  near-pure  public  goods. 

This  is  the  arena  in  which  CV  is  attempting  to  gain  acceptance  as  a  valid 
source  of  evidence  about  WTPC .   Whereas  traditionally  only  actual  commitments  of 
money  in  markets  and  perhaps  in  referenda  have  been  accepted  as  evidence  of 
WTPt,  now  CV  researchers  are  suggesting  that  survey  responses  to  questions  about 
economic  values  be  accepted  as  valid  evidence  about  those  values.   Not 
surprisingly,  there  has  been  considerable  skepticism  about  whether  CV  can 
produce  valid  measures  of  true  values.   The  skepticism  of  economists  is  based 
on  more  that  mere  conservatism  regarding  acceptable  evidence.   Economic  theory 
leads  them  to  expect  that  people  will  find  it  in  their  self-interest  to 
intentionally  state  misleading  values  where  public  goods  such  as  environmental 
quality  are  to  be  valued,  especially  if  real  commitments  of  money  are  not 
required.   Psychologists  have  come  to  CV  with  their  own  reasons  for  skepticism, 
thinking  that  people  may  have  trouble  constructing  accurate  estimates  of  WTPt  in 
response  to  survey  questions.   Without  more  experience,  respondents  may  have  an 
inadequate  basis  for  evaluating  their  values  even  if  they  want  to  reveal  them. 

Skepticism  about  new  methods  of  measurement  is  healthy  in  science,  but 
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progress  comes  from  empirical  work  and  not  simply  speculating  about  possible 
problems.   The  accuracy  of  any  valuation  method  must  be  assessed  in  terms  of 
how  close  it  comes  to  measuring  the  ideal.   If  WTPt  were  observable,  then  there 
would  be  no  problem.   One  would  simply  observe  it.   Given  that  WTPt  is  not 
observable,  more  complex  criteria  and  "rules  of  evidence"  are  needed  to  assess 
accuracy.   Continuing  to  borrow  from  psychology,  accuracy  in  measurement 
depends  on  the  reliability  and  validity  of  the  data. 

Reliability  and  Validity 

These  concepts  are  clearly  laid  out  by  Mitchell  and  Carson  (1989,  pp. 120- 
125) .   A  measurement  technique  is  unreliable  if  random  error  in  the  data  is 
generates  reach  unacceptable  levels.   Stated  differently,  measures  are  more 
reliable  the  less  "noisy"  are  the  data.   Validity  involves  systematic  errors  in 
measurement.   As  Mitchell  and  Carson  (1989,  190)  have  pointed  out,  "The 
validity  of  a  measure  is  the  degree  to  which  it  measures  the  theoretical 
construct  under  investigation."   In  the  current  context,  the  "theoretical 
construct  under  investigation"  is  WTPC .   A  CV  measure  (or  a  measure  derived 
using  another  method)  is  "biased, "  and  hence  invalid,  if  it  tends  to  depart  in 
a  systematic  way  from  WTPC . 

The  economist  is  interested  in  reliability  and  validity  on  a  different 
level  from  the  psychologist .   When  psychologists  use  these  terms  they  are 
primarily  interested  in  accuracy  at  the  level  of  the  individual  human  being. 
Think  about  IQ  testing,  for  example.   While  psychologists  interested  in 
intelligence  may  have  some  interest  in  the  average  IQ  of  aggregates  of  people 
such  as  all  four  year  olds,  they  are  primarily  concerned  about  measuring 
intelligence  of  individuals.   Economists  seeking  to  evaluate  changes  in 
economic  welfare,  on  the  other  hand,  are  interested  in  the  accuracy  of  averages 
and  aggregates.   A  considerable  amount  of  unreliability  in  observed  WTP  is 
tolerable,  provided  bias  is  non-existent  or  at  least  within  tolerable  bounds. 
Accurate  estimates  of  mean  WTPt  can  be  made  in  the  face  of  unreliability  by 
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simply  increasing  sample  size.5   Those  doing  research  on  the  validity  of  CV  and 
other  valuation  methods  are  concerned  primarily  about  the  validity  of  average 
observed  WTP  over  appropriately  defined  aggregates  of  individuals. 

Current  practice  in  CV  requires  that  we  raise  one  other  issue  relating  to 
the  concept  of  validity.   Application  of  CV,  like  any  empirical  undertaking, 
involves  numerous  issues  that  must  be  resolved  based  on  the  judgement  of  the 
researcher.   In  damage  assessments  and  to  substantial  degree  in  benefit-cost 
applications  as  well,  researchers  today  will  often  resolve  these  issues  by 
choosing  the  "conservative"  course.   That  is,  in  exercising  judgement,  they 
choose  courses  of  action  that,  if  anything,  will  lead  to  lower  estimates  for 
mean  WTPt .   If,  as  some  fear,  CV  studies  tend  to  overestimate  mean  WTPt,  then 
conservatism,  at  least  within  limits,  may  pull  estimated  values  toward  the 
theoretical  value.   However,  if,  as  others  believe,  CV  values  tend  to  be 
accurate  measures  of  mean  WTPt,  conservatism  must  be  viewed  as  a  practical 
departure  from  normal  scientific  practice.   To  arrive  at  estimates  of  average 
WTPt  that  are  more  easily  defended  in  court  or  the  policy  arena,  researchers 
make  conservative  choices  in  study  design  and  execution,  thus  intentionally 
introducing  possible  sources  of  downward  bias.   The  target  is  no  longer  the 
WTPc,but  an  underestimate  of  it.   Whatever  the  practical  merits  of  conservatism, 
the  ultimate  goal  of  CV  studies  is  still  to  estimate  aggregate  WTPC  for  the 
population  in  question.   Conservatism  in  measurement  must  be  viewed  as  an 
attempt  to  arrive  at  lower  bounds  for  the  desired  values.   In  the  long-run, 
refinements  will  hopefully  lead  to  estimates  that  approach  mean  and  aggregate 
WTPC  from  below. 

Because  mean  WTPt  is  in  principle  unobservable,  inferences  about  validity 
must  be  based  on  indirect  evidence  rather  than  direct  comparisons  of  estimated 
average  WTPt  to  true  values.   Indirect  evidence  may  relate  to  the  content 
validity,  the  construct  validity,  or  the  criterion  validity  of  a  measure. 


5   Of  course,  unreliability  in  observing  values  at  the  individual  level  can 
lead  to  biases  in  the  estimation  of  higher  moments  for  the  aggregate. 
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Content  Validity 

Content  validity  has  to  do  with  whether  the  design  and  execution  of  the 
study  are  conducive  to  the  revelation  of  the  WTPt .   In  other  words,  assessing 
content  validity  involves  examining  the  "content"  of  the  study  design  and 
execution.   For  CV  studies,  assessing  content  validity  involves  four  steps. 
First,  the  study  design  must  be  compared  to  the  economic  theory  underlying  WTPt . 
Second,  the  extent  to  which  the  study  communicates  effectively  to  the  relevant 
population  must  be  evaluated.   These  first  two  steps  might  be  summarized  by 
saying  that  a  valid  CV  study  must  deal  with  both  Homo  economicus  and  Homo 
sapiens  in  ways  that  support  WTPt  estimation.   Third,  whether  various  facets  of 
study  execution  were  adequate  must  be  considered.   Fourth,  the  econometrics 
used  to  estimate  mean  and  aggregate  WTPt  and  other  statistics  are  examined.   We 
now  consider  each  of  these  aspects  of  content  validity  in  more  detail. 

In  dealing  with  Homo  economicus.  the  scenario  of  a  valid  study  provides 
the  context  and  information  that  would  lead  a  theoretical  consumer  to  reveal 
his  or  her  WTPt .  Fischoff  and  Furby  (1988)  provide  a  useful  framework  for 
developing  a  CV  scenario  with  content  validity.   (Also  see  Bishop,  Champ  and 
Mullarkey  1994).   The  premise  of  Fischoff  and  Furby's  framework  is  that  a  CV 
exercise  should  fulfill  the  requirements  of  a  satisfactory  transaction.   They 
define  a  satisfactory  transaction  as  "involving  individuals  who  are  fully 
informed,  uncoerced,  and  able  to  identify  their  own  best  interests"   (Fischoff 
and  Furby  1988,  148) .   Three  aspects  of  the  transaction  must  be  adequately 
defined  and  understood  by  participants:   the  good,  the  payment  and  the 
marketplace.   All  three  can  affect  the  value  an  individual  places  on  a  good. 

The  good.      Defining  the  good  to  be  valued  in  the  CV  question  can  be  very 
difficult.   "Goods  may  be  thought  of  as  bundles  of  attributes,  representing 
outcomes  of  accepting  the  transaction  that  might  be  valued  either  positively  or 
negatively"  (Fischoff  and  Furby  1988,  153).   The  researcher  must  determine 
which  attributes  affect  the  value  an  individual  places  on  a  good,  a 
particularly  difficult  task  if  the  individual  is  not  familiar  with  the  good 
prior  to  receiving  the  survey.   Fischoff  and  Furby  also  mention  the  importance 
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of  specifying  the  reference  and  target  levels  of  the  good.   Survey  respondents 
need  to  know  the  level  of  provision  of  the  environmental  attributes  "without" 
and  "with"  the  intervention. 

The  payment.       In  most  CV  studies  the  hypothetical  payment  is  made  in 
dollars.   However,  as  use  of  CV  expands  to  other  contexts,  alternative  forms  of 
payment  such  as  labor  hours  are  also  being  used  (Swallow  and  Mulato  1994)  .   The 
direction  of  payment  must  also  be  made  clear.   Will  the  individual  pay  for  the 
good  or  will  the  individual  be  compensated?   Another  important  aspect  of  the 
payment  is  the  mechanism  through  which  the  payment  will  be  made.   Commonly  used 
mechanisms  include  income  taxes,  property  taxes,  sales  taxes,  entrance  fees, 
changes  in  market  prices  of  goods  and  services,  and  payments  to  special  funds. 

The  marketplace   or  context   for  valuation.      The  third  aspect  of  the  CV 
scenario  that  must  be  specified  is  what  Fischoff  and  Furby  term  the 
"marketplace."   Since  a  market  per  se,  as  that  term  is  normally  understood, 
need  not  be  involved,  we  prefer  to  think  in  terms  of  the  "context"  in  which  the 
transaction  is  to  take  place.   The  researcher  must  define  the  extent  of  the 
market  (who  the  other  potential  market  participants  are) ,  how  and  when  the 
environmental  amenity  might  be  provided,  and  the  decision  rule  for  the 
provision  (e.g.  majority  vote,  individual  payment,  etc.) . 

Beginning  with  the  very  early  CV  studies,  economists  have  feared  that 
strategic  responses  would  badly  distort  value  estimates.   Even  though  empirical 
studies  have  shown  that  strategic  behavior  need  not  have  a  large  impact  on  CV 
results,  there  is  no  reason  to  invite  trouble  in  this  regard.   Most  researchers 
would  agree  that  it  is  desirable  to  use  an  incentive  compatible  context  in  CV 
questions . 

Whether  one  takes  the  theoretical  perspective  of  Homo  economicus  or  the 
practical  perspective  of  how  to  deal  effectively  with  Homo  sapiens  information 
on  the  good,  the  payment,  and  the  context  are  central.   Homo  sapiens  has 
certain  additional  needs  that  must  be  attended  to.   It  is  not  difficult  to 
imagine  a  study  that  covers  all  the  theoretical  aspects  well  but  fails  to 
effectively  communicate  terms  of  the  proposed  "transaction"  to  the  individuals 
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whose  values  are  to  be  estimated.   Thus,  Mitchell  and  Carson  (1989,  192) 

recommend  asking  the  following  questions  when  assessing  the  content  validity  of 

a  CV  scenario: 

Does  the  description  of  the  good  and  how  it  is  to  be  paid  for  appear  to  be 
unambiguous?   Is  it  likely  to  be  meaningful  to  the  respondents?   Is  there 
anything  in  the  scenario  that  might  suggest  to  some  respondents  that  the 
good  would  not  be  paid  for?   Are  the  property  right  and  the  market  for  the 
good  defined  in  such  a  way  that  the  respondents  will  accept  the  WTP  format 
as  plausible?   Does  the  scenario  appear  to  force  reluctant  respondents  to 
come  up  with  WTP  amounts? 

CV  researchers  are  increasingly  recognizing  that  dealing  with  Homo  sapiens 
is  not  a  trivial  problem.   The  result  is  increasing  emphasis  on  "qualitative 
research"  as  final  CV  surveys  are  prepared.   Focus  groups,  verbal  protocols, 
observed  one-on-one  interviews,  pretests,  and  pilot  studies  can  be  used  to 
determine  how  information  and  questions  can  be  most  understandably  presented. 

While  including  qualitative  research  procedures  in  CV  studies  is 
commendable,  their  limitations  in  establishing  the  content  validity  of  CV 
studies  must  be  recognized.   Focus  groups,  verbal  protocols,  observed  one-on- 
one  interviews,  and  other  such  procedures  involve  small  samples  which  may  not 
be  representative.   Furthermore,  standard  methods  for  conducting  such 
procedures  and  for  reporting  results  do  not  exist,  at  least  at  present. 
Typically,  the  outsider  who  is  attempting  to  assess  the  content  validity  of  a 
completed  study  is  left  with  little  more  than  a  statement  to  the  effect  that  a 
certain  number  of  focus  groups  were  conducted.   There  is  always  the  danger  that 
such  exercises  were  conducted  in  such  a  way  that  relevant  points  of  confusion 
and  misunderstandings  did  not  surface.   Research  to  standardize  procedures  may 
be  helpful.   Pretests  and  pilot  tests  are  more  "quantitative"  and  hence  more 
amenable  to  standardized  procedures,  but  may  not  help  identify  more  fundamental 
design  flaws.   For  example,  responses  to  pretests  and  pilots  may  not  be 
adequate  to  identify  ways  in  which  respondents  misunderstood  the  CV  scenario. 
Inviting  and  carefully  studying  verbatim  comments  on  the  survey  from 
respondents  as  well  as  designing  follow-up  questions  designed  to  probe  for 
understanding  may  be  helpful . 

Clearly,  qualitative  research  to  enhance  content  validity  is  a  very 
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important  part  of  any  CV  study,  particularly  studies  of  non-use  values,  where 
respondents  may  lack  intimate  familiarity  with  the  environmental  resources 
being  valued.   Much  more  research  is  needed  to  build  consensus  on  proper 
procedures  for  conducting  this  phase  of  instrument  development  and  reporting 
results. 

Finally,  a  study  that  is  inadequate  in  its  econometrics  would  be  of 
questionable  content  validity.   This  aspect  is  noted  for  completeness,  but 
little  more  needs  to  be  said  here.   Economists  have  the  training  to  do  well  in 
this  regard.   To  the  extent  that  CV  studies  involve  competent  applications  of 
econometric  methods,  they  will  have  higher  content  validity,  all  else  equal. 

In  the  end,  content  validity  cannot  be  proven  in  some  objective  sense. 
Rather,  it  is  up  to  the  researcher  to  demonstrate  to  his  or  her  peers  that  the 
study  is  designed  and  executed  to  be  as  conducive  as  possible  to  revelation  of 
WTPt.   To  be  sure,  some  evidence  can  be  accumulated  that  may  support  the  content 
validity  of  a  study.   Questions  can  be  added  to  the  survey  to  generate  data 
relevant  to  the  assessment  of  content  validity,  for  example.   Results  from 
focus  groups  and  other  qualitative  procedures  can  be  called  upon  for  support  as 
well.   Content  validity,  nevertheless,  is  ultimately  a  matter  of  professional 
j  udgment . 

Actual  CV  studies  will  display  differing  degrees  of  content  validity. 
That  a  given  study  has  some  apparent  flaws  in  this  regard  will  not  normally  be 
grounds  for  totally  dismissing  its  results.   Still,  the  more  such  flaws  are 
identified  and  the  more  serious  they  are  judged  to  be  by  peer  reviewers,  the 
less  credible  are  the  estimates  of  value  and  more  tenuous  and  tentative  are  any 
policy  and  methodological  conclusions  derived  from  them.   This  is  a  point  which 
has  not  been  adequately  recognized  in  the  current  literature.   Instead,  it  is 
not  unusual  for  authors  to  attempt  to  draw  rather  sweeping  conclusions  from 
results  that  many  would  judge  as  having  limited  content  validity.   While  our 
primary  purpose  is  to  consider  broad  principles  for  assessing  the  validity  of 
CV  studies  and  not  to  give  detailed  reviews  of  individual  studies,  a  couple  of 
examples  of  studies  from  the  current  literature  may  help  to  clarify  the  sorts 
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of  problems  we  are  concerned  about. 

Much  has  been  made  of  the  so-called  embedding  problem  based  primarily  on 
the  paper  by  Kahneman  and  Knetsch  (1992) .   Drawing  upon  studies  reported  in 
their  paper,  Kahneman  and  Knetsch  (1992,  58)  concluded  that  embedding  effects 
are  "perhaps  the  most  serious  shortcoming  of  CVM  [the  contingent  valuation 
method]  ..."   Yet  we  have  serious  concerns  about  the  content  validity  of  the 
studies  on  which  this  conclusion  is  based.   Consider  the  study  involving 
valuation  of  improvements  in  the  environment,  disaster  preparedness,  and 
medical  personnel  and  equipment.   At  all  three  levels  of  embeddedness ,  study 
participants  were  asked  to  value  vaguely  defined  products  under  vaguely  defined 
conditions.   The  reference  levels  of  services  were  not  defined  at  all  and  the 
target  levels  were  merely  described  as  "improvements."   Nothing  was  said  about 
which  particular  attributes  would  be  improved,  about  the  physical  locations  of 
the  changes,  about  the  timing  of  the  changes,  or  other  aspects.   Instead  of 
specific  details  about  the  terms  of  the  proposed  transaction,  respondents  were 
left  with  vague  references  to  taxes,  prices,  and  user  fees  to  be  placed  in  some 
undefined  "specific  fund."   Given  such  weaknesses  in  content  validity,  how  can 
generalizations  about  the  shortcomings  of  CV  be  made? 

Or,  consider  a  second  study.   Based  on  several  different  CV  treatments 
involving  logging  of  wilderness  areas  in  the  western  U.S.,  Diamond  et  al . 
(1992,  15)  concluded  that,  in  general,  "whatever  contingent  valuation  surveys 
are  measuring,  they  are  not  measuring  consumers'  preferences  for  environmental 
amenities."   Does  their  study  have  sufficient  content  validity  to  support  such 
a  far-reaching  conclusion?   Respondents  were  told  that  seven  or  eight  or  nine 
wilderness  areas,  depending  on  the  treatment,  would  be  logged  somewhere,  but 
the  location  was  not  mentioned  except  that  at  least  one  would  be  located 
somewhere  in  the  respondent's  home  state  (Colorado,  Wyoming,  Montana,  or 
Idaho).   Via  telephone  interviews,  respondents  were  asked  to  value  one  or  more 
specified  wilderness  areas,  again  depending  on  the  treatment,  but  these  areas 
were  only  roughly  described  in  terms  of  size  and  location,  and  little  was  said 
about  how  and  when  the  logging  would  be  conducted  or  what  effects  it  might  have 
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on  environmental  parameters.   Given  all  the  vagueness  in  the  information 
provided,  respondents  may  be  forgiven  for  expressing  values  that  were  not 
consistent  with  prior  expectations  of  the  researchers,  based  on  economic 
theory.   It  is  not  surprising,  for  example,  that  respondents  missed  the 
subtleties  of  whether  seven  or  eight  or  nine  other  wilderness  areas  were 
already  slated  for  logging.   Furthermore,  most  CV  researchers  give  a  lot  of 
attention  to  the  realism  of  the  payment  vehicle;   the  vehicle  of  a  federal 
income  tax  surcharge  to  save  one  or  a  few  specific  wilderness  areas  would  be 
unprecedented  in  the  past  experience  of  respondents.   Furthermore,  most  CV 
researchers  would  shy  away  from  a  telephone  survey  for  a  study  involving  an 
issue  of  this  complexity.   No  qualitative  research  results  to  allay  such 
concerns  were  reported.   If  one  agrees  that  the  content  validity  of  the  Diamond 
et  al .  study  is  highly  questionable,  then  surely  sweeping  conclusions  about  the 
efficacy  of  the  CV  method  as  a  whole  based  on  the  study  results  are  equally 
questionable . 

Construct  Validity 

Construct  validity  deals  with  the  degree  to  which  the  measure  under 
scrutiny  (in  our  case  observed  WTP)  relates  to  other  measures  as  predicted  by 
theory.   Mitchell  and  Carson  (1989)  discuss  two  forms  of  construct  validity  -- 
convergent  and  theoretical.   Tests  of  convergent  validity  consider  the 
relationship  between  the  CV  measure  of  the  good's  value  and  alternative 
measures  of  the  good's  value.   For  example,  convergent  validity  could  be 
assessed  by  comparing  values  estimated  from  a  CV  model  to  values  estimated 
using  a  travel  cost  model  or  an  hedonic  pricing  model.   At  least  at  present, 
opportunities  for  convergent  validity  tests  of  non-use  values  are  extremely 
limited.   CV  is  widely  considered  the  only  method  capable  of  estimating  such 
values.   Hence,  we  will  not  deal  with  convergent  validity  further  here. 

Theoretical  validity  is  often  assessed  by  considering  the  relationship 
between  the  CV  measure  and  independent  variables  that  are  thought  to  be 
potential  determinants  of  WTPt .   Assessing  the  theoretical  validity  of  a  measure 
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may  involve  simple  contingency  table  analyses.   Or,  more  sophisticated 
multivariate  regression  procedures  are  applied  and  coefficients  on  potentially 
important  independent  variables  are  scrutinized  for  statistical  significance, 
appropriate  signs,  and  relative  magnitudes  of  related  variables  in  a  regression 
analysis . 

Diamond  et  al .  (1992),  among  others,  have  recently  advocated  a  different 
form  of  theoretical  validity  test.   They  advocate  testing  CV  responses  against 
hypotheses  based  on  neoclassical  consumer  theory.   For  example,  one  would 
expect  WTPt  to  be  larger  the  more  of  an  environmental  amenity  is  provided  or  the 
larger  is  the  environmental  insult  that  is  avoided.   CV  estimates  of  values 
should,  if  they  are  measuring  true  values,  exhibit  relative  magnitudes 
consistent  with  this  hypothesis.   Such  tests  of  hypotheses  about  expected 
variations  in  estimated  values  with  respect  to  changes  in  the  scope  of 
environmental  improvements  or  insults  have  come  to  be  known  as  "scope  tests." 
Within  the  taxonomy  being  followed  here,  scope  tests  are  theoretical  validity 
tests.   Theory  also  tells  us  that  values  should  be  sensitive  to  the 
availability  of  complements  and  substitutes.   Hence,  one  might  construct  CV 
exercises  to  test  hypotheses  about  the  effects  on  expressed  values  of  different 
levels  of  availability  of  complements  and  substitutes.   Or,  transitivity  is  one 
of  the  core  assumptions  about  preferences.   In  a  split  sample  design,  if  Sample 
1  places  a  higher  value  on  intervention  A  than  intervention  B  and  Sample  2 
values  intervention  B  more  highly  than  intervention  C,  then  one  would 
hypothesize  that  a  third  sample  ought  to  value  intervention  A  more  highly  than 
intervention  C.   If  not,  then  the  validity  of  the  CV  study  in  question  would  be 
more  doubtful  on  theoretical  grounds. 

A  fundamental  question  follows:   What  if  the  results  from  a  CV  application 
fail  to  pass  a  theoretical  validity  test?   Suppose,  for  example,  that  a 
valuation  equation  fails  to  produce  a  positive  relationship  between  observed 
WTP  and  income  of  respondents.   Should  the  whole  study  be  thrown  out  as 
invalid?   Surely  not.   Failure  to  get  a  positive  coefficient  on  income  could 
even  be  consistent  with  theory.   The  value  of  the  environmental  amenity  in 
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question  may  simply  be  insensitive  to  income. 

Now  suppose  that  the  results  fail  a  scope  test  or  some  other  test  based  on 
consumer  theory.   At  a  minimum,  failure  of  a  scope  test  would  call  for  even 
more  careful  scrutiny  of  the  content  validity  of  the  study.   One  would 
certainly  want  to  ask  whether  the  qualitative  phases  of  the  research  succeeded 
in  identifing  dimensions  of  the  environmental  amenity  in  question  that  matter 
to  potential  respondents.   The  matter  of  whether  the  survey  instrument 
communicates  well  would  need  to  be  pursured  with  extra  vigor.   Careful 
attention  to  statistical  issues  would  also  be  warranted.   Values  are  often 
inherently  highly  variable,  and  failure  of  a  scope  test  might  simply  reflect 
the  fact  that  the  effects  of  scope  got  lost  in  the  noisiness  of  the  data.   In 
the  end,  CV  studies  that  fail  scope  tests  will  have  less  credibility  that  those 
that  pass. 

Failure  of  more  complex  tests  based  on  economic  theory  may  be  more 
difficult  to  interpret.   Suppose  efforts  fail  to  show  that  values  are  sensitive 
to  availability  of  complements  and  substitutes  or  that  intransitivity  is 
present  in  the  results.   Again  this  may  indicate  flaws  in  study  design  or 
statistical  noise.   Or,  more  fundamental  problems  may  be  indicated.   As  we  have 
already  noted,  the  economic  theory  on  which  such  tests  are  based  is  highly 
abstract  and  stylized.   In  other  words,  reality  is  some  unknown  mixture  of  what 
theory  models  and  "other  things."   Failure  of  responses  to  a  CV  survey  to 
conform  to  theory  may  indicate  that  "other  things"  (phenomena  not  considered  in 
the  theory)  are  having  a  strong  influence.   What  these  "other  things"  are  and 
whether  they  are  somehow  biasing  results  will  usually  be  very  unclear. 

Thus,  the  overall  conclusion  would  be  that  a  CV  study  that  displays  strong 
theoretical  validity  ought  to  be  considered  superior  to  those  that  display 
weaknesses  in  these  respects  or  do  not  include  theoretical  validity  tests  in 
their  study  design  at  all.   This  is  so  because  studies  with  strong  theoretical 
validity  are  more  likely  to  be  indicative  of  true  values  as  defined  in  theory. 
Suspected  weaknesses  identified  during  theoretical  validity  testing  may 
indicate  flaws  in  study  design  that  were  not  detected  when  the  content  validity 
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of  the  study  was  assessed.   Or,  they  may  arise  because  unknown  other  factors 
outside  the  theory  used  to  define  WTPC  are  influencing  results. 

This  line  of  reasoning  leads  to  another  question:   At  what  point  should 
failures  of  theoretical  validity  tests  be  considered  "fatal?"   That  is,  when  is 
the  point  in  theoretical  validity  testing  reached  where  the  study  is  simply 
rejected?   This  is  not  a  question  that  is  easily  dealt  with.    Studies  vary 
greatly  in  their  objectives  and  the  budgets  provided  to  accomplish  those 
objectives.   Further,  the  current  state  of  the  art  in  CV  does  not  include  well- 
developed,  widely  agreed  upon  approaches  to  theoretical  validity  testing. 

As  a  starting  point,  we  propose  two  kinds  of  theoretical  validity  tests, 
which  we  shall  term  rudimentary  and  advanced  tests.   Rudimentary  tests  should 
be  feasible  using  data  from  a  single  survey  of  a  single  sample.   Through 
regression  analysis,  contingency  tables  involving  correlations,  or  other  such 
procedures,  the  rudimentary  tests  would  explore  the  theoretical  validity  of  the 
results.   Variables  in  the  analysis  would  normally  include  income, 
socioeconomic  characteristics,  self -reported  past  behavior6,  and  attitudinal 
measures.   A  study  that  lacks  the  money  and/or  time  to  gather  information  on 
such  variables  for  a  singe  sample  probably  ought  to  be  considered  infeasible. 
More  advanced  construct  validity  tests  are  those  involving  more  costly  split- 
sample  designs  that  will  support  testing  hypotheses  on  scope,  variations  in 
substitutes  and  complements,  and  the  other  hypotheses  based  on  theory. 

We  propose  that  studies  be  categorized  into  a  three-level  hierarchy 
expressing  increasing  degrees  of  construct  validity.   At  the  lowest  level  would 
be  studies  that  either  have  not  included  any  construct  validity  tests  or  have 
failed  to  pass  rudimentary  tests.    Such  studies  might  typically  have  had  low 
budgets  and/or  severe  time  constraints  and  this  may  have  limited  the  amount  of 
qualitative  research  that  could  be  conducted,  thus  limiting  the  content 
validity  of  the  study  as  well.   Such  studies  may  still  be  useful  for  scientific 


6  Having  previously  visited  a  site  that  is  the  subject  of  a  non-use  value 
study,  for  example,  might  affect  WTP .  Membership  in  environmental  organizations, 
participation  in  recreational  activities  related  to  the  environmental  resources 
in  question,  past  volunteer  activities,  and  other  such  variables  could  also  be 
considered  in  such  analyses. 
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purposes  or  as  exercises  involving  training  of  students,  but  should  be  used  in 
policy  analysis  and  litigation  only  with  the  heaviest  caveats.   The  second 
level  of  the  hierarchy  would  involve  studies  that  have  achieved  a  fair  amount 
of  success  in  the  rudimentary  tests,  but  that  either  do  not  have  the  budget  to 
support  advanced  testing  or  have  not  succeeded  in  passing  advanced  tests. 
Second-level  studies  may  be  usable  in  cost-benefit  analyses,  since  normally 
such  analyses  are  simply  interested  in  determining  whether  the  benefits  of  an 
intervention  exceed  the  costs.   Of  course,  suitable  caveats  would  need  to  be 
introduced  into  such  studies.   Unless  benefits  exceed  costs  by  a  fairly  wide 
margin  or  vice  versa,  potential  biases  in  second  level  studies  may  mean  that 
the  issue  of  whether  benefits  exceed  costs  remains  open.   Second  level  studies 
may  be  less  useful  for  litigation,  where  relatively  precise  estimates  of  value 
are  needed  to  assess  damages,  but  they  may  still  be  useful  in  preliminary 
damage  assessments  and  for  such  purposes  as  "grossly  disproportionate  tests."7 
Third  level  studies  are  studies  that  have  conducted  and  achieved  substantial 
success  in  sophisticated  rudimentary  tests  and/or  have  conducted  and  passed 
advanced  tests.   Provided  that  such  studies  are  judged  to  have  a  high  degree  of 
content  validity  as  well,  they  would  have  the  highest  level  of  credibility  for 
benefit-cost  analysis  and  litigation. 

The  other  issue  needing  attention  in  this  section  is  the  extent  to  which 
success  or  failure  in  construct  validity  tests  implies  that  CV  itself  is  a 
valid  or  invalid  procedure.   Notice  that  the  focus  is  changed  substantially. 
Up  to  now  we  have  been  concerned  about  the  validity  of  individual  applications. 
Now  the  ability  of  the  CV  method  in  general  to  estimate  true  values  is  the 
issue.   Can  construct  validity  testing  help  to  address  this  more  fundamental 
question? 

The  difficulties  created  by  our  inability  to  observe  true  values  become 
painfully  apparent  at  this  point.   While  failure  to  pass  construct  validity 


The  primary  goal  of  settlements  in  natural  resource  damage  cases  in  the 
U.S.  is  to  rehabilitate  and  restore  injured  resources.  However,  responsible 
parties  need  not  carry  out  rehabilitation  and  restoration  if  the  costs  of  doing 
so  are  grossly  dispropotionate  to  the  benefits  to  the  public  from  such  actions. 
This  makes  it  necessary  to  measure  benefits  at  least  approximately. 


20 
tests  does,  as  we  have  just  seen,  create  some  degree  of  doubt  about  individual 
studies,  the  problems  thus  identified  may  be  particular  to  the  study  in 
question.   Perhaps  design  flaws  in  the  individual  study  are  to  blame.   Such 
flaws  may  have  been  identified  when  content  validity  was  considered,  but  they 
may  also  have  been  hidden.   Perhaps  the  "other  things"  that  theory  ignores  are 
more  potent  or  play  themselves  out  in  some  unusual  way  in  the  study  in 
question.   For  these  reasons,  the  temptation  to  draw  sweeping  conclusions  about 
the  validity  of  the  CV  method  from  one  or  only  a  few  studies  should  be 
resisted.   Only  if  a  large  number  of  seemingly  high-quality  CV  studies  fail  to 
demonstrate  scope,  sensitivity  to  substitutes  and  complements,  or  other  such 
tests  should  the  validity  of  CV  itself  be  called  into  question.   Likewise,  if 
CV  studies  with  a  high  degree  of  content  validity  appear  to  pass  such  tests 
rather  consistently,  then  the  validity  of  the  method  would  be  supported.   In 
the  intermediate  case  where  CV  studies  sometimes  pass  such  tests  and  sometimes 
fail  them,  a  very  different  conclusion  would  be  warranted.   Assuming  that  the 
studies  in  question  have  a  high  degree  of  content  validity,  such  an  outcome 
would  imply  that  intervening  factors  are  sometimes,  but  not  always,  interfering 
in  the  successful  application  of  the  method.   Research  would  be  called  for  to 
identify  what  those  intervening  factors  are.   New  criteria  to  be  applied  in 
content  validity  assessment  would  follow.   Ultimately,  it  may  be  possible  to 
develop  criteria  for  judging  where  CV  can  be  applied  with  good  prospects  for 
success  and  where  it  cannot . 

Criterion  Validity 

To  assess  criterion  validity,  Mitchell  and  Carson  (1989,  192)  point  out 
that  "it  is  necessary  to  have  in  hand  a  criterion  which  is  unequivocally  closer 
to  the  theoretical  construct  [WTPt  or  its  mean  value  over  some  population]  than 
the  measure  whose  validity  is  being  assessed  [the  CV-based  measure  value] .  " 
The  closer  the  contingent  value  is  to  the  criterion,  the  more  valid  it  is 
judged  to  be. 

Given  the  credibility  that  preference  expressions  in  markets  have  as 
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indicators  of  true  values,  actual  market  prices  would  be  ideal  criteria  to  use 
in  evaluating  contingent  values;  however,  because  such  market  prices  are  rare 
in  the  environmental  area  and  especially  for  non-use  values,  so-called 
"simulated  market"  values  are  perhaps  a  more  promising  alternative  for  judging 
the  criterion  validity  of  contingent  values.   Simulated  markets  involve 
creating  situations  in  the  field  or  laboratory  where  subjects  have  the 
opportunity  to  actually  pay  for  the  good  or  service  or  receive  compensation  for 
it.   The  same  good  or  service  is  also  valued  by  use  of  the  CV  method. 

Simulated  markets  differ  from  real  markets  in  two  ways.   First,  each 
individual  may  be  involved  in  only  one  transaction.   Second,  the  mechanism  by 
which  the  price  is  determined  may  seem  somewhat  artificial  to  participants. 

Kealy,  Montgomery,  and  Dovidio  (1990)  argued  that,  "The  WTP  values 
measured  in  a  simulated  market  are  the  best  available  criterion  to  evaluate  the 
self-reports  of  WTP  from  the  corresponding  hypothetical  situation  posed  by  the 
contingent  valuation  method"  (Kealy,  Montgomery,  and  Dovidio  1990,  247)  . 
Likewise,  Bishop  et  al .  (1984)  say  that  "experimental  studies  are  needed 
because  methodological  cross-checks  [i.e.,  evidence  of  construct  validity] 
usually  do  not  produce  values  that  can  be  used  as  exact  standards  for 
comparison"  (Bishop  et  al .  1984,  7). 

Bohrnstedt  (1983)  provides  a  useful  distinction  between  two  kinds  of 
criterion  validity:   predictive  validity  and  concurrent  validity.   Applied  in 
the  context  of  CV,  predictive  validity  might  be  assessed  by  asking  subjects  a 
CV  question  at  one  point  in  time  and  later  give  the  same  individual  a  chance  to 
actually  purchase  the  same  good  simulated  market  exercise  (see,  for  example, 
Kealy  et  al .  1990) .   For  concurrent  validity,  the  measure  and  the  criterion 
against  which  it  is  to  be  assessed  are  measured  simultaneously.   This  could 
involve  a  split  sample  design  where  randomly  assigned  subjects  participate  in 
either  a  CV  exercise  or  a  simulated  market  where  the  same  good  can  actually  be 
bought  or  sole  (see,  for  example,  Bishop  and  Heberlein  1979) . 

Simulated  market  experiments  have  been  conducted  for  both  use  values  (Bohm 
1972;  Bishop  and  Heberlein  1979;  Dickie  et  al .  1987;  Coursey,  Hovis  and  Schulze 
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1987;  Kealy,  Montgomery,  and  Dovidio  1990;  Bishop,  Welsh,  and  Heberlein,  1993; 
Neill  et  al .  1994)  and  total  values  where  non-use  values  are  likely  to  be  a 
substantial  component  (Boyce  et  al .  1989;  Kealy,  Montgomery  and  Dovidio  1990; 
Duf field  and  Patterson  1992;  Seip  and  Strand  1992;  Champ  et  al .  1994) .   It  is 
tempting  to  launch  into  a  discussion  of  these  studies,  for  they  provide  many 
interesting  features  and  results.   Space  and  time  preclude  doing  justice  to 
such  an  exercise.   It  must  suffice  to  say  that  the  evidence  is  somewhat  mixed, 
with  several  studies  providing  support  for  the  criterion  validity  of  CV, 
especially  for  use  values.   Simulated  market  experiments  may  be  showing  an 
upward  bias  in  CV  estimates  of  total  values  (including  non-use  values) ,  but 
such  a  conclusion  is  still  tentative  and  the  subject  of  a  continuing  debate. 
Potential  flaws  in  all  of  the  total  value  studies  raise  questions  about  their 
content  validity.   Where  such  concerns  exist,  doubts  arise  about  whether 
resulting  simulated  market  value  estimates  are  of  sufficient  quality  to  serve 
as  criteria  at  all.   Though  we  would  be  reluctant  to  go  so  far,  one  might 
succeed  in  arguing  that  a  simulated  market  of  sufficient  quality  to  provide  a 
criterion  validity  test  for  CV  has  yet  to  be  conducted.   Clearly,  the 
challenges  of  conducting  a  simulated  market  to  estimate  non-use  values  with  a 
high  degree  of  content  validity  are  formidable. 

An  alternative  with  some  promise  would  be  to  use  CV  to  predict  the  share 
of  the  population  that  would  vote  in  favor  of  propositions  in  actual  referenda. 
The  potential  usefulness  of  referenda  as  indicators  of  true  values  has  already 
been  discussed  here  and  the  advantages  of  the  referendum  format  for  CV  non-use 
value  studies  are  well  established  in  the  literature  (Mitchell  and  Carson  1989; 
Hoehn  and  Randall  1987) .   At  least  one  study  using  this  approach  has  already 
been  conducted  (Carson,  Hanemann,  and  Mitchell  1986) .   Though  additional  such 
studies  should  be  pursued,  their  disadvantage  relative  to  simulated  markets 
should  be  recognized.   True  values,  as  defined  here,  are  monetary  values. 
Voting  in  real  referenda  will  not  produce  direct  evidence  to  estimate  true 
values  that  can  serve  as  a  standard  of  comparison  for  contingent  values. 
Referenda  can  only  produce  as  criteria  the  percentages  of  citizens  voting  for 
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and  against  ballot  propositions.   Econometric  models  of  CV  responses  would  then 
be  used  to  predict  the  vote  based  on  survey  responses.   The  null  hypothesis  to 
be  tested  is  that  the  predicted  vote  equals  the  actual  vote.   Results  would 
obviously  be  relevant  to  understanding  the  validity  of  CV,  but  would  not  be 
fully  equivalent  to  comparing  monetary  values. 

Criterion  validity  tests  hold  both  great  promise  and  limitations  in  the 
debate  over  the  CV  method.   The  point  here  can  be  summarized  by  saying  that, 
for  purposes  of  evaluating  the  validity  of  any  particular  CV  application, 
criterion  validity  tests  are  external  tests.   This  is  a  terse  way  of  pointing 
out  that  the  main  usefulness  of  criterion  validity  tests  is  in  the  overall 
evaluation  of  CV  as  a  method  of  valuation  and  not  in  evaluating  the  validity  of 
individual  studies,  except  indirectly.   If  a  simulated  market  with  reasonably 
high  content  validity  were  possible  under  the  conditions  present  in  most  CV 
applications,  there  would  be  no  need  for  CV.   Rather,  researchers  must  seek 
very  special  field  conditions  or  highly  artificial  laboratory  situations  to 
conduct  simulated  market-CV  comparisons.   Conclusions  of  criterion  validity 
tests  will  help  assess  the  prospect  for  successfully  applying  CV  under  normal 
circumstances.   Positive  conclusions  from  simulated  market  (and  referenda 
studies,  for  that  matter)  would  provide  general  support  for  field  applications 
of  CV.   They  might  also  provide  insights  into  desirable  design  features  of  CV 
studies,  thus  enhancing  the  criteria  to  be  applied  in  content  validity 
assessments.   Nevertheless,  success  in  simulated  market  experiments  would  not 
be  grounds  for  uncritical  acceptance  of  results  from  individual  CV 
applications.   The  validity  of  each  individual  application  would  still  need  to 
be  assessed  through  content  validity  assessment  and  construct  validity  testing. 

Summary 

The  debate  over  CV  is  a  debate  that  can  only  be  resolved  with  empirical 
evidence.   Too  often,  it  seems  to  us,  the  critics  of  CV  simply  speculate  about 
what  might  go  wrong  in  CV  studies  and  then  assume  that  those  things  actually  do 
go  wrong.   On  the  other  hand,  some  CV  practioners  have  probably  been  far  too 
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willing  to  accept  the  results  of  their  studies  at  face  value.   Carrying  on  the 
debate  at  this  level  will  continue  to  bear  little  fruit. 

In  order  to  consider  the  empirical  evidence  systematically  and 
objectively,  the  ground  rules  must  be  determined  in  advance.   What  is  needed  is 
a  theory  of  measurement  for  CV  that  can  guide  the  empirical  research.   To  that 
end,  we  have  attempted  in  this  paper  to  expand  on  a  theme  that  already  exists 
in  the  literature  on  CV.   The  approach  builds  on  the  theory  of  psychological 
testing.   It  is  based  on  the  premise  that  true  values  are  unobservable,  so  that 
validity  must  be  evaluated  indirectly,  through  drawing  of  inferences  from  a 
number  of  directions.   In  psychology,  reliability  must  be  carefully  considered 
along  with  validity.   If  the  data  are  too  noisy,  it  will  be  impossible  to 
detect  biases  in  observations  at  the  individual  level  even  if  the  noise  is 
simply  random.   Reliability  is  somewhat  important  for  CV  as  well,  but  it  takes 
on  less  importance  because,  in  economics,  the  goal  is  normally  to  measure 
values  for  a  population,  and  not  for  the  individual.   Individual  values  could 
be  quite  noisy,  yet  average  values  could  be  unbiased  and  accuracy  would  simply 
be  a  matter  of  sample  size. 

The  Three  C's- -content  validity,  construct  validity,  and  criterion 
validity- -seem  to  us  to  fit  the  problem  of  evaluating  the  validity  of  CV  rather 
well.   The  first  step  toward  achieving  a  valid  CV  study- -if  that  is  possible- - 
is  to  design  and  execute  the  study  properly,  and  the  goal  of  content  validity 
assessment  is  to  evaluate  the  study  procedures .   Since  the  goal  is  to  measure 
the  true  value,  a  well -designed  study  will  show  a  close  correspondence  between 
the  economic  theory  which  is  used  to  define  the  true  value  and  the  structure 
and  wording  of  the  CV  question.   Stated  differently,  everything  that 
neoclassical  consumers  would  need  to  formulate  and  reveal  their  true  values 
must  be  present  in  the  CV  exercise.   Furthermore,  since  normal  human  beings, 
and  not  theoretical  consumers,  will  be  asked  to  complete  the  survey,  care  must 
be  taken  to  design  the  survey  to  be  effective  in  communicating  with  real 
people.   It  must  cover  and  communicate  all  the  parameters  of  the  intervention 
that  are  of  concern  to  study  subjects.   Theory  offers  only  general  guidance  on 
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this  topic.   There  are  numerous  hidden  pitfalls  here  that  may  require  a 
substantial  amount  of  qualitative  research  to  overcome.   Furthermore,  a  study 
with  high  content  validity  will  have  carefully  executed  all  phases  of  the 
survey,  from  population  definition  and  sampling  through  coding  and  entry  of  the 
responses.   Finally,  a  study  with  high  content  validity  will  have  applied 
appropriate  econometric  procedures  to  arrive  at  values  and  related  statistics. 
Content  validity  is  a  matter  of  human  judgement.   The  process  of  establishing 
the  content  validity  of  a  CV  study  is  ultimately  one  of  convincing  one's  peers 
that  the  study  design  and  execution  had  no  major  flaws  that  interfered 
substantially  with  the  revelation  and  estimation  of  true  values.   One  of  the 
main  points  we  have  tried  to  make  in  this  paper  is  that  the  more  serious  are 
the  potential  flaws  in  study  design  and  execution,  the  more  tentative  must  be 
any  conclusions  drawn  for  policy,  litigation,  and  methodology. 

While  visible  flaws  that  may  have  contributed  to  biased  final  results  will 
be  identified  through  the  content  validity  assessment,  whether  such  biases 
actually  exist  and  how  large  they  might  be  will  not  normally  be  clear  from 
content  validity  assessments.   In  addition,  some  flaws  in  study  design  may  be 
hidden.   Construct  validity  tests  are  needed  to  help  identify  such  flaws.   For 
non-use  valuation  studies,  such  tests  will  mainly  involve  testing  hypotheses 
about  how  different  contingent  values  from  the  same  study  relate  to  each  other 
and  to  other  variables  measured  in  the  study.   Studies  that  fail  to  show 
expected  relationships  in  rudimentary  tests  will  have  the  least  credibility. 
At  the  other  end  of  the  spectrum  will  be  studies  that  pass  a  battery  of 
rudimentary  and  advanced  tests  of  theoretical  validity.   To  have  maximum 
credibility  in  benefit-cost  analyses  and  litigation,  a  study  must  be  toward  the 
high  end  of  this  hierarchy. 

It  is  entirely  possible,  of  course,  that  CV  is  incapable  of  producing 
valid  estimates  of  non-use  values  even  under  the  best  of  circumstances.   To 
some  extent,  theoretical  validity  testing  can  shed  light  on  this  more 
fundamental  issue.   The  more  often  studies  fail  the  theoretical  validity  tests, 
the  more  likely  it  is  that  CV  is  a  scientific  dead-end.   Likewise,  the  more 
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often  such  tests  are  passed,  the  more  likely  it  is  that  CV  is  actually 
providing  valid  information  about  true  values.   Because  individual  studies  vary 
widely  in  their  subject  matter  and  content  validity,  the  weight  of  evidence 
from  many  studies  will  be  required  in  order  to  judge  the  overall  validity  of 
CV.   The  temptation  to  reach  broad  conclusions  on  the  basis  of  one  or  a  few 
studies  should  be  resisted.   Studies  with  high  content  validity  should  be  taken 
more  seriously  than  those  with  low  content  validity. 

Criterion  validity  tests  may  be  capable  of  casting  further  light  on  the 
overall  validity  of  the  method.   Here  simulated  market  experiments- -while  they 
involve  significant  challenges  in  the  design  and  implementation- -appear  to  be 
the  most  promising  avenue.   Further  efforts  to  find  circumstances  where 
simulated  markets  are  feasible  could  produce  rich  dividends  in  judging  the 
potential  of  the  method.   Here  again,  individual  studies  are  unlikely  to  be 
definitive.   Also,  care  will  always  be  required  in  extrapolating  conclusions 
from  such  simulated  markets  to  field  applications  of  CV.   Content  and  construct 
validity  will  continue  to  play  a  central  role  in  evaluating  the  validity  of 
individual  CV  studies. 
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Quoting  Mitchell  and  Carson  (1989,  190) ,  "The  validity  of  a  measure  is  the 
degree  to  which  it  measures  the  construct  under  investigation."  In  applied 
welfare  economics,  the  construct  is  most  often  one  of  the  Hicksian  measures  of 
economic  value.  Assessing  the  accuracy  of  consumer  welfare  measures  is 
difficult  because  true  Hicksian  values  are  inherently  unobservable .  Hence 
estimated  values  cannot  be  compared  directly  with  true  values  to  judge  the 
performance  of  measurement  techniques  (Bishop  et  al .  1994) .  This  is  the  case 
whether  the  valuation  technique  in  question  is  contingent  valuation  (CV)  or  one 
of  the  methods  that  attempt  to  infer  values  from  revealed-pref erence  data. 
Hence,  less  direct  forms  of  evidence  about  the  validity  of  valuation  techniques 
are  required. 

The  raging  debate  over  CV  among  economists,  spawned  in  part  by  work 
surrounding  the  Exxon  Valdez  oil  spill  (Carson  et  al .  199x;  Hausman  199x)  ,  is 
a  debate  over  the  validity  of  the  method.  This  paper  rests  on  the  premise  that 
progress  in  this  debate  is  hampered  by  a  lack  of  consensus  among  economists 
regarding  criteria  for  judging  the  validity  of  welfare  estimates  in  general  and 
CV  in  particular. 

Mitchell  and  Carson  have  suggested  that  evidence  about  the  validity  of  a 
measure  may  relate  to  its  content  validity,  construct  validity,  or  criterion 
validity.  Each  of  these  approaches  "offers  a  different  strategy  for  assessing 
the  measure-construct  relationship,  and  each  is  applicable  to  contingent 
valuation  in  one  way  or  another."  (Mitchell  and  Carson  1989  190)  This  paper 
focuses  on  content  validity. 

Content  validity  assessment  involves  evaluation  of  study  design  and 
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execution.1  Partly,  it  is  guided  by  theory.  Value  estimates  will  ultimately 
be  interpreted  as  estimates  of  the  true  values  as  defined  in  theory.  From  a 
practical  standpoint,  this  means  that  a  survey  instrument  and  supporting 
materials  must  be  designed  in  ways  that  would  support  revelation  of  true  values 
by  the  consumer  of  economic  theory.  For  a  CV  study  to  be  fully  content  valid, 
respondents  must  have  incentives  for  true  value  revelation  and  enough 
information  to  make  utility  maximizing  choices.  Content  validity  assessment 
also  asks  whether  the  CV  study  procedures  were  designed  to  interact  effectively 
with  real  people.  It  is  not  hard  to  imagine  a  study  that  is  strongly  linked  to 
theory,  yet  fails  to  deal  effectively  with  potential  survey  respondents. 
Through  experience,  CV  researchers  have  learned  that  this  is  not  a  trivial 
problem.  Finally,  to  be  content  valid,  the  survey  and  subsequent  analysis  and 
presentation  of  results  must  be  adequately  executed.  Here,  attention  is  focused 
upon  such  topics  as  sampling,  response  rates,  and  econometric  procedures. 

Figure  1  will  serve  as  a  point  of  departure  for  our  study.  This  Figure 
is  a  proposed  tool  for  rating  the  content  validity  of  CV  studies.  Our  purpose 
in  proposing  the  rating  form  is  not  to  attempt  to  set  up  ourselves  or  anyone 
else  as  ultimate  authorities  on  the  content  validity  of  CV  studies.  Rather  we 
hope  simply  to  make  content  validity  assessment  more  systematic.  The  rating 
form  is  designed  to  help  CV  researchers  plan  and  conduct  better  studies  and 
reviewers,  in  their  various  roles2,  to  be  clearer  about  their  reasons  for 
judging  studies  to  be  strong  or  weak  in  terms  of  the  procedures  that  were 
followed.  The  rating  form  contains  a  checklist  of  considerations  that  should 
be  addressed  in  designing  and  executing  studies  with  high  content  validity.  We 
do  not  expect  this  list  of  questions  to  be  particularly  controversial.  When 
weights  are  assigned  to  the  different  dimensions,  more  objections  are  likely  to 


1  Our  definition  of  content  validity  assessment  is  significantly  broader 
than  that  of  Mitchell  and  Carson  (1989  190-192) .  They  focus  exclusively  on 
examination  of  the  survey  instrument,  while  we  would  include  all  aspects  of 
study  design  and  execution. 

That  is,  those  who,  as  journal  reviewers,  consultants  to  decision 
makers,  expert  witnesses,  or  in  other  such  roles,  are  called  upon  to  evaluate 
the  merits  to  CV  studies. 
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arise,  but  Figure  1  is  amenable  to  whatever  weights  a  particular  researcher  or 
reviewer  deems  appropriate. 

Much  of  this  paper  is  occupied  with  explaining  what  we  are  trying  to  get 
at  in  the  rating  sheet  and  why.  Each  of  the  questions  in  Figure  1  will  be 
explained  and  justified.  Toward  the  end  of  the  paper,  the  relationships  between 
content  validity,  construct  validity,  and  criterion  validity  will  be  considered. 
Obviously,  judgements  about  the  validity  of  any  study  will  depend  on  more  than 
its  content  validity.  Important  additional  evidence  will  come  from  subjecting 
study  results  to  hypothesis  testing  based  on  theoretical  expectations  (i.e., 
construct  validity  testing)  .  How  CV  has  fared  in  laboratory  and  field 
experiments  and  other  efforts  to  test  its  criterion  validity  will  also  be 
relevant.  Still,  we  will  argue  that  the  merits  of  individual  studies  must 
continue  to  rest  to  a  significant  degree  on  their  content  validity.  Finally, 
toward  the  end  of  the  paper,  attention  will  turn  from  validity  assessment  of 
individual  studies  to  consider  possible  "rules  of  evidence"  for  judging  whether 
or  not  the  CV  method  itself  is  valid. 

I .  THE  RATING  FORM 
The  questions  found  if  Figure  1  have  two  sources.  First,  the 
justification  for  interpreting  values  from  CV  studies  as  estimates  of  true 
Hicksian  values  will  be  stronger  the  more  closely  study  design  is  linked  to  the 
theory  of  Hicksian  values.  To  implement  this  principle,  we  would  propose  the 
following  test:  Would  respondents  be  both  able  and  willing  to  answer  the  CV 
question  or  questions  in  ways  consistent  with  their  true  values  if  they  were  the 
utility-maximizing  consumers  of  economic  theory.3  Stated  differently,  content 
valid  studies  must  be  designed  to  elicit  true  values  from  homo  economicus.  To 
the  extent  that  they  do  so,  reviewers'  judgements  about  the  content  validity  of 
a  study  will  be  enhanced.  Second,  through  a  combination  of  common  sense,  the 
current  state  of  knowledge  in  survey  research  generally,  modern  econometrics, 


Some  confusion  may  be  avoided  by  noting  that  this  is  only  one  of  the 
places  where  theory  plays  a  role  in  validity  assessment.  For  example,  theory 
provides  the  hypotheses  that  are  tested  in  construct  validity  assessment. 
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and  experience  to  date  in  CV  research,  much  is  known  about  how  to  design  and 
execute  CV  studies.    A  content  valid  study  will  employ  state-of-the-art 
procedures.   Stated  differently,  content  valid  studies  must  be  designed  to 
interact  effectively  with  homo  sapiens. 

Let  us  suppose  that  some  "intervention"  in  the  economy  affects 
environmental  attributes  relevant  to  potential  respondents.  Such  an 
intervention  could  take  the  form  of  public  projects,  alterations  in 
environmental  regulations,  or  new  policies  that  somehow  affect  the  environment. 
Interventions  may  also  take  the  form  of  accidental  or  intentional  environmental 
insults  such  as  oil  spills  and  emissions  of  air  pollutants.  If  the  impacts  of 
the  intervention  are  viewed  as  beneficial  by  relevant  members  of  society,  we 
will  refer  to  it  as  a  positive  intervention.  Likewise,  negative  interventions 
are  those  that  are  viewed  by  relevant  people  as  detrimental  to  their  welfare. 
The  goal  of  CV  studies  is  to  measure  the  values  that  members  of  appropriately 
defined  populations  of  people  place  on  attaining  positive  interventions  or 
avoiding  negative  ones. 

We  begin  with  a  preliminary  screening  question  that  asks  whether  the  study 
under  review  has  fatal  flaws.  It  only  makes  sense  to  go  into  study  details  if 
there  are  no  fatal  flaws  in  study  procedures.  If  the  study  meets  the  reviewer's 
minimum  standards,  then  the  body  of  the  rating  form  provides  a  pathway  to 
consider  the  merits  of  the  detailed  study  procedures.  In  this  section,  we  shall 
review  each  of  the  questions  in  some  detail. 

(1)   Do  study  procedures  contain  flaws  that  are  so  serious  that  they  would  rule 
out  use  of  the  results  to  achieve  study  goals? 

Flaws  may  creep  into  CV  studies  through  simple  lack  of  foresight  on  the 

part  of  study  designers.   Furthermore,  some  flaws  are  knowingly  accepted  as 

compromises  required  to  achieve  other  goals.   For  example,  in  some  situations, 

a  referendum  format4  or  some  other  mechanism  with  theoretically  strong  incentive 


4  A  referendum  format  frames  the  CV  question  in  terms  a  voting  for  or 
against  the  intervention  given  that  an  affirmative  vote  will  require  some  sort 
of  payment  or  payments . 
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characteristics  may  be  very  implausible  to  potential  respondents.  One  might 
adopt  donation  payment  vehicles5  in  such  situations  despite  their  theoretical 
inferiority.  The  first  step  in  content  validity  assessment  is  to  judge  whether 
imperfections,  introduced  either  inadvertently  or  intentionally,  are 
sufficiently  detrimental  to  make  the  study  content  invalid. 

Inadequate  documentation  of  study  procedures  is  one  type  of  fatal  flaw. 
Because  replication  of  surveys  is  problematical,  the  burden  of  proof  in  survey 
research  is  on  individual  researchers  to  demonstrate  that  they  followed 
procedures  conducive  to  obtaining  valid  results.  To  assess  content  validity, 
a  reviewer  must  have  a  clear  statement  of  the  study  goals,  a  definition  of  the 
true  value  to  be  estimated,  a  description  of  the  intervention  and  its  effects 
on  environmental  amenities,  a  fairly  detailed  summary  of  the  procedures  followed 
throughout  the  study,  and  a  copy  of  the  full  survey  instrument  including  the 
survey  form  and  any  supporting  materials.  Without  a  minimal  set  of  such 
materials,  it  will  be  impossible  to  do  a  full  content  validity  assessment  and 
the  study  would  be  judged  content  invalid.  The  only  possible  exception  would 
be  where  fatal  flaws  are  discovered  from  available  though  incomplete  materials. 

Now  assume  that  documentation  for  the  study  under  review  is  adequate  to 
conduct  a  full  content  validity  assessment.  Examination  of  Figure  1  will  show 
that  a  study  lacking  fatal  flaws  will  be  subjected  to  16  questions  about  the 
detailed  study  procedures,  ranging  from  the  theoretical  analysis  of  the  true 
value  of  the  intervention  (Question  2)  to  the  adequacy  of  reports  and  articles 
from  the  study  (Question  16) .  Spreading  100  total  points  over  16  questions 
means  that  any  one  item  does  not  receive  a  lot  of  weight.  Before  proceeding 
with  their  assessments,  reviewers  must  be  reasonably  satisfied  that  detailed 
study  procedures  meet  minimum  standards. 

Suppose,  for  example,  that  a  study  employed  telephone  interviews  in  a  way 
that  the  reviewer  judges  to  be  not  at  all  adequate  to  provide  sound  data  on 
respondents'   values.    Such  a  study  would  have  failed  to  meet  minimum 


Framing  a  CV  question  around  a  donation  vehicle  means  that  subjects  are 
asked  about  amounts  they  would  donate  toward  implementation  of  the  intervention 
if  it  is  positive  or  toward  avoiding  it  if  adverse  consequences  predominate. 
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requirements  on  Question  12  in  this  reviewer's  eyes,  yet  perfection  in  all  other 
dimensions  would  earn  it  a  very  high  rating  of  90  points.   In  such  a  situation, 
the  reviewer  would  need  to  declare  that  the  study  did  not  meet  minimum  standards 
for  survey  mode  and  thus  is  content  invalid. 

Other  examples  of  flaws  that  many  potential  reviewers  would  consider  fatal 
come  to  mind.  Most  people  involved  in  the  field  today  would  be  highly 
suspicious  of  a  study  that  attempted  to  measure  willingness  to  accept 
compensation,  no  matter  how  competently  it  was  designed  and  executed.  This 
would  be  a  fatal  violation  of  the  principles  embodied  in  Question  8.  Or, 
suppose  that  a  study's  description  of  an  intervention's  effects  were  extremely 
vague.  Reviewers  might  judge  such  a  study  to  be  fatally  flawed  even  if  all 
other  procedures  were  adequate.  A  fatal  violation  of  the  principles  underlying 
Question  4  would  have  occurred. 

In  practice,  there  would  appear  to  be  less  than  complete  consensus  about 
which,  if  any,  flaws  in  detailed  procedures  ought  to  be  considered  so  serious 
as  to  warrant  judging  a  study  to  be  totally  without  content  validity.  As  will 
happen  often  in  this  paper,  reviewers  will  have  to  fall  back  on  personal 
judgement.  Whether  flaws  are  sufficiently  serious  to  be  judged  fatal  must,  for 
the  time  being,  remain  a  matter  for  individual  reviewers  to  decide,  based  on 
their  interpretations  of  their  own  work,  if  any,  and  the  larger  literature. 

Notice  also  that  study  goals  enter  here.  The  possibility  of  fatal  flaws 
can  only  be  evaluated  in  the  context  of  the  goals  that  the  study  in  question  was 
designed  to  achieve.  A  study  that  is  supposed  to  be  a  first  preliminary 
investigation  of  natural  resource  damages,  for  example,  should  probably  not  be 
held  up  to  the  same  standards  as  one  that  is  designed  to  serve  as  a  basis  for 
final  damage  estimates.  A  low-budget  study  designed  to  serve  primarily  as  a 
student  project  might  well  leave  some  loose  ends  unaddressed  that  would  be 
unacceptable  in  a  study  destined  to  inform  policy  makers  directly. 

If  a  study  has  flaws  that  are  judged  to  be  fatal,  then  the  review  ends  and 
it  is  assigned  a  total  score  of  zero.  Otherwise,  the  review  proceeds  to 
consider  the  study  procedures  in  detail.   Potential  flaws  will  almost  certainly 
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be  identified  once  the  review  proceeds  into  the  detailed  questions.  Such 
potential  flaws,  though  not  fatal  in  themselves,  raise  concerns,  or  to  use  a 
common  expression,  "red  flags."  The  more  such  red  flags  pop  up  during 
evaluation  of  a  study,  the  less  valid  it  will  be  judged  to  be.  Our  weighting 
system  is  designed  to,  in  a  sense,  count  red  flags,  or  rather  the  lack  of  them. 
We  begin  the  search  for  red  flags  with  a  question  motivated  by  theory. 

(2)   Was  the  true  value  clearly  defined? 

The  link  between  theory  and  the  CV  exercise  can  be  strengthen- -thus 
enhancing  content  validity- -by  carefully  defining,  in  theoretical  terms,  what 
is  to  be  measured.  The  simplest  model  of  the  consumer's  choice  problem  where 
environmental  quality  matters  will  illustrate: 

max  U(X;Q)  subject  to  PX  <  Y, 
where  X  is  a  vector  of  conventional  goods  and  services  that  can  be  purchased  at 
exogenously  determined  prices  P,  Q  is  an  exogenously  determined  vector  conveying 
the  status  of  environmental  attributes  affecting  consumer  welfare,  Y  is  income, 
and  U(.)  is  a  "well-behaved"  utility  function.  Let  us  assume  for  the  time  being 
that  the  only  effect  of  the  intervention  in  question  is  to  alter  the  status  of 
environmental  attributes,  let  us  say  from  Q'  to  Q" .  Here,  Q'  will  be  taken  to 
be  the  status  of  environmental  attributes  "without"  the  intervention  and  Q"  the 
their  status  "with"  it.  We  assume  for  the  time  being  that  P  and  Y  are  not 
affected  by  the  intervention. 

Theory  tells  us  that  the  maximum  level  of  utility,  arrived  at  by  solving 
the  choice  problem  just  stated,  can  be  expressed  by  an  indirect  utility 
function,  V(P,Q,Y) .  Assuming  that  the  Hicksian  compensating  welfare  measure  is 
relevant,  the  "true  value  of  the  intervention"  to  this  consumer,  which  we  shall 
symbolize  by  T,  is  defined  by 

V(P,Q' ,Y)  =  V(P,Q",Y-T) . 

Now  suppose  a  CV  study  was  conducted  to  estimate  T  for  each  member  of  a 
sample  drawn  from  some  predetermined  population  of  individuals.  Working  out  T 
in  such  formal  terms  would  help  identify  several  requirements  for  a  fully 
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content  valid  study.  First,  for  respondents  to  arrive  at  their  estimates  of  T, 
they  would  have  to  be  well  informed  about  how  the  intervention  would  affect  all 
relevant  parameters  of  their  choice  problem.  Thus,  examining  the  content 
validity  of  the  study  in  question  would  in  part  focus  on  efforts  expended  during 
the  study  to  assure  that  real  world  respondents  were  well  informed.  Several 
questions  in  Figure  1  deal  with  various  dimensions  of  information.  In  addition, 
even  the  simple  theory  serves  to  emphasize  that  the  budget  constraint  is 
important  in  determining  T.  Hence,  we  might  ask  whether  actual  respondents  were 
adequately  reminded  of  their  budget  constraints.  Question  (6)  focuses  on  this 
particular  aspect.  In  these  ways,  theory  helps  us  to  frame  the  questions  in 
Figure  1 . 

Designers  of  CV  studies  should  carefully  consider  the  definition  of  T 
applicable  in  their  particular  case.  Formal  theoretical  modeling  of  the 
valuation  problem  never  hurts.  Writing  out  the  equations  may  seem  mundane,  but 
can  prove  helpful  in  identifying  gaps  and  flaws  in  the  information  and  context 
that  will  ultimately  be  provided  in  the  CV  scenario.6 

Some  studies  will  be  able  to  focus  on  effects  of  the  intervention  on 
environmental  attributes  alone,  as  we  did  in  the  model  just  presented.  Other 
studies  may  have  to  deal  with  effects  on  prices,  incomes,  and  other  parameters 
as  well.  Additional  theoretically  relevant  issues  may  arise.  For  example,  the 
timing  of  both  effects  and  payments  may  affect  true  values.  Where  uncertainty 
of  one  kind  or  another  is  a  potentially  significant  factor  in  the  theoretical 
consumer's  valuation  problem,  this  should  be  explicitly  modeled  and  the  welfare 
measure  sought  (e.g.,  option  price)  defined.  The  results  of  such  exercises 
should  appear  in  any  technical  reports  from  the  study.  Figure  1  allows  up  to 
5  points  to  be  assigned  to  a  study  depending  on  how  well  it  defined  the  true 
value  or  values  it  sought  to  measure. 


(3)   Were  the  environmental  attributes  relevant  to  potential 
subjects  fully  identified? 


6   In  CV  jargon,  the  "scenario"  is  the  part  of  the  survey  instrument  that 
communicates  to  respondents  what  is  to  be  valued  and  under  what  circumstances. 
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In  the  abstract,  we  represented  the  environmental  attributes  affecting 
consumer  welfare  by  including  the  vector  Q  in  the  direct  and  indirect  utility- 
functions.  However,  theory  alone  offers  limited  guidance  regarding  which  actual 
attributes  are  relevant  to  real  world  study  subjects  and  which  are  not.  From 
the  potentially  large  set  of  attributes  of  the  environment  that  might  be 
relevant  in  theory,  a  subset  must  be  defined  that  human  respondents  believe 
affects  their  welfare. 

Introspection  and  casual  observation  on  the  part  of  the  researchers  help 
to  formulate  working  hypotheses  about  which  attributes  have  value.  For  example, 
it  seems  likely  that  attributes  affecting  human  health  are  important  to  people. 
However,  it  may  be  necessary  to  go  beyond  introspection  and  casual  observation 
to  sort  out  which  attributes  matter. 

The  tools  of  qualitative  research  can  be  used  to  learn  more  about  which 
attributes  are  and  are  not  important  to  potential  respondents.  CV  studies  often 
employ  focus  groups.  Researchers  may  also  observe  one-on-one  interviews  with 
subjects  from  the  pool  of  potential  respondents.  Such  interviews  and 
particularly  debriefing  session  with  subjects  afterwards  can  help  sort  out  the 
relevant  attributes.  Verbal  protocols  (Schkade  and  Payne  1994)  may  be  analyzed 
to  further  explore  how  respondents  view  the  attributes .  Use  of  such  techniques 
enhances  content  validity. 

We  have  allocated  up  to  10  points  for  this  aspect.  How  many  points  to 
assign  to  a  study  will  vary  depending  on  the  particular  circumstances.  Studies 
where  respondent-relevant  attributes  are  rather  simple  and  obvious  may  earn  the 
full  10  points  after  little  or  no  qualitative  research.  On  the  other  hand, 
consider  a  possible  case  where  the  nature  of  the  environmental  amenities 
potentially  affected  by  the  intervention  are  complex  and  difficult  to  explain 
to  potential  respondents.  Suppose  that,  even  though  substantial  efforts  were 
devoted  to  qualitative  research,  those  efforts  do  not  appear  to  have  adequately 
identified  which  attributes  people  are  concerned  about  and  why.  In  such  a  case, 
reviewers  would  assign  fewer  than  10  points  in  recognition  of  the  inherent 
difficulties  of  the  problem  and  despite  the  fact  that  the  researchers  did  all 
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they  could  be  expected  to  do  to  try  to  come  to  grips  with  the  problem. 

(4)   Were  the  potential  effects  of  the  intervention  on  environmental  attributes 
and  other  economic  parameters  adequately  documented  and  communicated? 

Once  the  environmental  attributes  relevant  to  potential  study  subjects 
have  been  determined,  the  next  step  in  study  design  is  to  document  how  the 
intervention  will  affect  those  attributes.  This  is  normally  done  by  finding  out 
what  physical  and  biological  scientists  know  (and  do  not  know)  about  how  the 
intervention  will  affect  those  attributes.7  Impacts  on  relevant  non- 
environmental  parameters  such  as  prices  and  incomes  also  need  to  be  documented 
in  cases  where  they  are  likely  to  occur.  The  more  thoroughly  such  effects  were 
investigated  and  documented,  the  higher  will  be  the  score  on  this  item.  Content 
validity  is  also  enhanced  when  scientific  uncertainty  about  effects  were 
carefully  noted  and  its  potential  relevance  to  study  subjects  investigated  as 
part  of  the  study  design  process. 

Once  the  potential  effects  of  the  intervention  were  documented,  choices 
has  to  be  made  about  what  to  say  about  them  in  the  CV  instrument.  Real  world 
respondents  may  come  to  CV  exercises  with  a  great  deal  of  information  or  no 
knowledge  at  all  about  the  relevant  attributes  of  the  environment.8  How  much 
knowledge  they  had  prior  to  the  survey  must  be  understood  by  the  designers  of 
studies.  On  the  one  hand,  going  over  aspects  of  the  amenity  that  are  already 
common  knowledge  may  needlessly  lengthen  the  survey  process,  possibly  insulting 
potential  respondents  in  the  process.  Response  rates  and  the  quality  of  the 
final  data  may  suffer.  On  the  other  hand,  poorly  informed  respondents  may  not 
value  the  environmental  effects  in  question  but  rather  some  other  set  of 


In  damage  assessments,  the  "intervention"  will  already  have  occurred  and 
it  will  be  necessary  to  learn  what  scientist  know  and  do  not  know  about  the 
resulting  injuries. 

8  There  is  an  on  going  debate  among  environmental  economists  about  whether 
the  status  of  an  attribute  can  be  "relevant"  to  consumers  who  are  not  aware  of 
it.  For  one  view  that  has  found  its  way  into  print,  see  Bishop  and  Welsh 
(1992)  .  Basically,  that  paper  argues  that,  as  a  practical  matter,  real  world 
consumers  can  not  be  expected  to  have  full  knowledge  about  all  the  things 
affecting  their  welfare.  Obscure  and  even  unknown  environmental  resources  could 
have  value  to  them. 
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environmental  effects  existing  only  in  their  imaginations.   A  poorly  informed 

respondent  is  left  to  guess  at  the  effects  of  the  intervention  and  arrive  at 

values  based  on  those  guesses.   For  respondents  to  have  been  well  informed,  the 

knowledge  that  they  brought  to  the  CV  exercise  may  have  needed  to  be  augmented 

by  information  provided  in  the  scenario. 

Investigator  introspection  and  casual  observation  may  again  have  served 

as  the  starting  points,  but  qualitative  research  may  have  been  required  to 

determine  the  quantity  and  quality  of  information  that  real  world  subjects  were 

likely  to  bring  to  the  valuation  exercise.  Once  pre-existing  knowledge  has  been 

assessed,  the  study  scenario  should  have  been  designed  to  provide  additional 

knowledge  as  needed.   At  the  end  of  this  process,  real  world  respondents  needed 

to  be  well  informed  about  the  status  of  all  relevant  attributes  with  and  without 

the  intervention.   The  NOAA  Panel  summarized  its  view  this  way  (Arrow  et  al., 

1993  4605) : 

If  CV  surveys  are  to  elicit  useful  information  about  willingness  to  pay, 
respondents  must  understand  exactly  what  it  is  they  are  being  asked  to 
value  (or  vote  upon)  .  .  . 

The  evidence  available  to  the  reviewer  regarding  whether  subjects  fully 

understood  the  extent  of  the  effects  being  described  to  them  may  be  somewhat 

limited.   Careful  reading  of  the  scenario  is  central  to  the  assessment  of  this 

aspect.   Debriefing  questions  regarding  what  respondents  believed  the  effects 

of  the  intervention  to  be  can  be  included  in  the  instrument  (see,  for  example, 

Carson  et  al .  1992)  .    In  some  recent,  as  yet  incomplete  studies,  we  have 

presented  subjects  with  complex  information  and  then  asked  them  to  complete  a 

series  of  true-false  questions  about  the  effects  of  the  intervention.   Focus 

groups  have  indicated  that  so  long  as  respondents  are  told  that  the  purpose  of 

the  true-false  questions  is  to  assess  how  well  we  communicated  with  them  rather 

than  to  test  their  knowledge  per  se,  such  question  do  not  generate  substantial 

respondent  resistance.  We  believe  that  these  questions  not  only  help  to  assess 

how  well  respondents  were  able  to  digest  the  information,  but  also  encourage 

them  to  go  back  and  re-read  material  that  may  have  been  unclear  the  first  time 

through. 
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In  recognition  of  the  importance  of  this  aspect,  the  Rating  Form  allows 
up  to  10  points  to  be  assigned  depending  on  how  well  the  study  did  in 
documenting  and  communicating  the  potential  effects  of  the  intervention. 

(5)  Were  respondents  aware  of  the  existence  and  status  of  environmental 
substitutes? 

Thus  far,  only  the  elements  of  the  vector  Q  that  would  be  affected  by  the 

intervention  have  been  considered.    Theory  tells  us  that  the  value  of 

environmental  amenities  affected  by  the  intervention  may  depend  on  the  status 

of  other,  unaffected  amenities  that  are  substitutes  for  or  complements  to  the 

affected  ones.    Content  validity  may,  therefore,  be  enhanced  by  assessing 

respondents  knowledge  of  the  existence  and  status  of  substitutes,  and  if 

necessary  adding  information  them  to  the  scenario.    Presumably  complements 

should  also  be  considered,  but  there  is  less  emphasis  on  them  in  the  thinking 

of  many  scholars,   including  members  of  the  NOAA  Panel.    The  Rating  Form 

recommends  that  up  to  5  points  be  awarded,  depending  on  the  reviewer's  judgement 

about  whether  subjects  were  well  informed  about  substitutes. 

(6)  Was  the  budget  constraint  adequately  stressed? 

Since  true  values  are  defined  in  a  framework  that  involves  budget - 
constrained  utility  maximization,  many,  including  the  NOAA  Panel,  argue  that 
study  subjects  ought  to  be  explicitly  reminded  of  their  budget  constraints. 
Failure  to  do  so  would  reduce  the  content  validity  of  a  study  in  the  eyes  of 
many  potential  reviewers. 

We  include  this  aspect  for  completeness,  though  we  doubt  that  the  alleged 
need  to  stress  budget  constraints  will  stand  up  to  empirical  scrutiny.  Too  many 
members  of  focus  groups  we  have  observed  spontaneously  mention  their  limited 
resources.  At  least  one  published  study  (Loomis,  Gonzalez -Caban,  and  Gregory 
1994)   has   found   statistically   indistinguishable   results   whether   budget 
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constraints  were  mentioned  or  not.9  Thus,  while  we  do  assign  5  points  to  this 
dimension,  we  admit  that  a  wide  range  of  opinions  exist  about  its  importance. 
Personally,  we  would  not  require  a  lot  of  evidence  that  respondents  were  aware 
of  their  budget  constraints  before  assigning  the  full  5  points.  Other  reviewers 
might  well  be  more  difficult  to  satisfy  in  this  regard. 

(7)   Was  the  context  for  valuation  fully  specified  and  incentive  compatible? 

In  addition  to  providing  respondents  with  needed  information  about  the 
effects  of  the  intervention,  a  CV  scenario  will  normally  provide  them  with  what 
we  shall  term  the  "context  for  valuation."  By  this,  we  mean  all  dimensions  of 
the  proposed  transaction  dealing  in  one  way  or  another  how  money  would  be 
transferred.  Whether  the  money  will  be  paid  to  or  received  by  the  respondents 
needs  to  have  been  clearly  spelled  out.  Points  might  be  lost,  for  example,  if 
the  nature  of  the  value  to  be  expressed  was  vague  (e.g.,  asking  "What  is  it 
worth  to  you?")  Whether  the  value  was  to  be  that  of  the  individual  or  of  the 
household  needs  to  be  clearly  stated.  Who  else  will  be  paying  or  receiving 
payment  (the  so-called  extent  of  the  market,  see  Smith  1993)  may  matter  for 
environmental  amenities  with  public  goods  characteristics.  Certainly,  theory 
dictates  that  the  timing  of  payments  has  relevance  to  valuation.  A  valid  CV 
study  will  strive  to  be  sure  that  the  context  of  valuation  is  as  complete  as 
possible . 

Furthermore,  theory  raises  some  rather  stern  warnings  about  the  incentive 
properties  of  CV  scenarios .  Incentive  compatibility  of  payment  mechanisms  is 
an  issue  even  for  amenities  such  as  recreational  opportunities  with  private 
goods  characteristics.  It  is  well  known,  for  example,  that  sealed-bid  auctions 
create  incentive  to  bid  less  than  one's  maximum  willingness  to  pay,  whereas  a 
Vickery  auction  should  lead  to  full  value  revelation,  all  else  equal.   This 


9  Loomis  et  al .  also  tested  whether  mentioning  substitutes  mattered  and 
found  no  effect.  We  are  less  ready  to  generalize  from  that  result,  however, 
because  both  the  level  of  knowledge  and  the  importance  of  substitutes  are  likely 
to  vary  greatly  from  study  to  study. 
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theoretical  result  may  have  practical  relevance  to  studies  using  an  open-ended 
CV  format.  Where  environmental  amenities  take  on  public  goods  characteristics, 
incentive  issues  are  magnified  because  of  the  possibility  of  free  riding  and 
strategic  responses.  The  theoretical  strengths  of  the  referendum  format  in  this 
context  are  widely  accepted  (e.g.,  Mitchell  and  Carson  1989  and  Hoehn  and 
Randall  1987)  and  led  the  NOAA  Panel  to  advocate  heavy  reliance  on  referenda  in 
CV  studies  for  purposes  of  damage  assessment.  In  such  circumstances,  use  of 
referendum  formats,  as  opposed  to  voluntary  donations,  for  example,  would 
enhance  content  validity  in  the  eyes  of  many  reviewers.  In  our  weighting 
scheme,  if  the  contest  for  valuation  is  complete  and  fully  incentive  compatible, 
it  would  be  awarded  5  points.  Fewer  points  would  be  assigned  to  studies  with 
scenarios  that  are  incentive  incompatible  in  recognition  of  the  potential 
confusion  or  strategic  responses  that  such  scenarios  could  induce. 

(8)  Did  the  CV  scenario  elicit  willingness  to  pay? 

That  utilizing  a  willingness-to-accept  valuation  context  would  be 
considered  a  fatal  flaw  by  many  reviewers  has  already  been  noted.  We  include 
this  item  for  the  special  cases  where,  because  of  special  circumstances  of  some 
sort  that  might  lead  a  reviewer  to  have  fewer  reservations  about  a  willingness- 
accept  question.  The  clear,  crisp  willingness-to-pay  question  will  earn  an  easy 
5  points,  while  fewer  point  could  be  allocated  when  willingness  to  accept  was 
elicited  in  some  none  fatal  way. 

(9)  Did  survey  participants  accept  the  scenario? 

CV  researchers  and  others  (e.g.,  the  NOAA  Panel)  have  come  to  recognize 
that  it  is  important  that  the  scenario  not  only  communicated  effectively,  but 
that  respondents  accept  it.  A  study  subject  accepts  the  scenario  when  he  or  she 
implicitly  agrees  to  proceed  with  the  valuation  exercise  based  on  the 
information  and  context  provided.  Scenario  rejection  can  lead  either  to  poor 
quality  valuation  data  or  item  non-response  on  the  CV  question.  The  NOAA  Panel 
expressed  it  this  way  (Arrow  et  al . ,  1993  4605)  : 
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.  .  .  even  when  CV  surveys  provide  detailed  and  accurate  information  about 
the  effects  of  the  program  being  valued,  respondents  must  accept  that 
information  in  making  their  (hypothetical)  choices.  If,  instead, 
respondents  rely  on  a  set  of  heuristics  ("these  environmental  accidents 
are  seldom  as  bad  as  we're  led  to  believe,"  or  "authorities  almost  always 
put  too  good  a  face  on  these  things"),  in  effect  they  will  be  answering 
a  different  question  from  that  being  asked;  thus,  the  resulting  values 
that  are  elicited  will  not  reliably  measure  willingness  to  pay. 

Whether  respondents  accepted  the  scenario  may  be  difficult  to  determine, 

but  some  evidence  can  be  mustered  to  help.   The  acceptability  of  the  scenario 

can  be  intentionally  evaluated  during  focus  groups  and  other  procedures  followed 

during  the  qualitative  phase  of  the  research.   Debriefing  questions  may  be 

included  in  the  survey  to  help  determine  whether  respondents  accepted  the 

scenario  when  answering  CV  questions. 


(10)   Did  survey  respondent  believe  the  scenario? 

Those  writing  on  CV  often  emphasize  that  it  involves  "hypothetical" 
valuation.  In  many  settings,  asking  study  subjects  to  play  "what  if"  games  in 
order  to  value  the  intervention  is  unavoidable  because  a  fully  believable 
scenario  is  impossible  to  construct.  However,  in  some  circumstances,  it  may  be 
possible  to  construct  a  scenario  with  a  high  degree  of  plausibility.  A  couple 
of  illustrations  can  be  presented. 

Carson  et  al .  (1992),  in  their  well-known  study  to  value  the  effects  the 
Exxon  Valdez  oil  spill,  told  respondents  that  double-hulled  tankers  would  be 
required  for  hauling  oil  in  coastal  waters  of  the  U.S.,  but  that  it  would  take 
a  decade  to  convert  the  fleet.  In  the  meantime,  they  noted,  the  federal 
government  was  considering  an  escort  ship  program  for  Prince  William  Sound.  The 
escort  ships  would  be  capable  of  dealing  quickly  with  any  accident  that  might 
otherwise  lead  to  a  major  oil  spill.  Without  the  escort  ships,  a  major  oil 
spill  could  be  expected  in  Prince  William  Sound  within  the  next  ten  years.  The 
escort  ship  program,  if  activated,  would  be  financed  in  part  by  a  one-time 
increase  in  federal  income  taxes.  Subjects  were  told  that  the  government  would 
decide  whether  or  not  to  go  forward  with  the  program  based  on  results  of  the 
survey.  Focus  groups  and  debriefing  questions  indicated  that  large  numbers  of 
respondents  believed  the  scenario. 
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Work  in  progress  is  focusing  on  possible  modifications  in  how  Glen  Canyon 
Dam  on  the  Colorado  River  is  operated  in  order  to  protect  and  enhance  resources 
downstream  in  the  Grand  Canyon.  If  enacted,  modifying  operations  of  the  dam 
will  reduce  its  ability  to  generate  power  on-peak.  A  very  likely  result  will 
be  increases  in  how  much  many  households  in  several  western  states  will  pay  for 
power.  One  sampling  frame  for  the  CV  study  on  this  problem  is  the  potentially 
affected  power  consumers.  A  referendum  format  is  being  used  and  the  payment 
vehicle  for  this  sampling  frame  will  be  power  rates.  Focus  groups  showed  that 
such  subjects  find  it  very  plausible  that  they  will  really  have  to  pay  if  dam 
operations  are  modified. 

That  subjects  believe  that  their  responses  will  affect  what  they  really 
pay  enhances  the  credibility  of  their  responses,  particularly  if  the  context  for 
valuation  is  incentive  compatible.  Accordingly,  to  the  extent  that  the  scenario 
is  believable,  content  validity  is  enhanced.  Figure  1  suggests  that  reviewers 
assign  up  to  5  points  depending  on  their  evaluation  of  how  believable  the 
scenario  was  and  any  evidence  provided  regarding  whether  the  respondents 
believed  the  scenario. 

(11)   How  adequate  and  complete  were  survey  questions  other  than  those  designed 
to  elicit  values? 

CV  surveys  typically  include  many  questions  other  than  those  intended  to 
elicit  values.  Several  different  objectives  may  be  involved.  For  one,  CV 
researchers  often  find  it  desirable  to  investigate  respondents '  motives  for 
answering  the  CV  questions  as  they  did. 

The  exact  form  of  such  questions  depends  on  both  the  form  of  the  CV 
question  and  the  researcher's  judgement.  For  example,  open-ended  questions 
elicit  responses  of  zero  with  substantial  frequency.  A  response  of  zero  may 
signify  that  the  respondent  has  a  zero  value  for  the  intervention,  but  it  could 
also  have  been  intended  to  communicate  that  respondents  do  not  know  their 
values,  refused  to  place  values  on  the  amenities,  rejected  the  scenario,  or 
hoped  that  their  responses  would  reduce  fees  actually  paid.  Thus,  many 
researchers  would  recommend  that  a  study  employing  an  open-ended  questions 
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include  additional  questions  to  help  interpret  zeroes. 

The  NOAA  Panel,  which  recommended  that  a  referendum  format  be  used,  also 
recommended  that  voting  be  followed  by  a  question  in  an  open-ended  format  asking 
respondents  to  explain  in  their  own  words  why  they  voted  as  they  did.  Other 
researchers  might  judge  that  closed-ended  questions  or  some  combination  of 
closed  and  open-ended  responses  be  used,  but  the  basic  objective  of  identifying 
problematical  responses  to  the  CV  question  would  be  the  same. 

Additional  questions  may  have  been  included  in  the  survey  to  help  support 
its  content  validity.  For  example,  questions  could  be  included  to  help  evaluate 
whether  respondents  understood  descriptive  material  in  the  scenario  relating  to 
attributes  and  the  context  of  valuation.  Many  past  studies  have  attempted  to 
use  follow-up  questions  to  identify  strategic  responses. 

Additional  question  may  also  be  included  to  support  construct  validity 
assessment.  Construct  validity  testing  will  normally  involve  the  estimation  of 
"valuation  equations"  where  relationships  between  answers  to  CV  questions  and 
other  variables  are  considered  either  in  cross  tabulations  or  in  regression 
equations  with  several  independent  variables  (Bishop  et  al .  ,  1994)  .  Many  types 
of  questions  can  be  included  in  the  survey  to  support  such  analyses.  For 
example,  the  NOAA  Panel  recommended  that  cross-tabulations  with  valuation 
responses  include  income,  knowledge  of  the  site,  prior  interest  in  the  site  for 
visitation  or  other  reasons,  environmental  attitudes,  attitudes  toward  big 
business,  distance  of  residence  from  the  site,  understanding  of  the  valuation 
task,  acceptance  of  the  scenario,  and  willingness  and/or  ability  to  perform  the 
task . 

Such  survey  questions  need  to  be  scrutinized  as  part  of  content  validity 
assessment.  Only  if  they  are  well  designed  will  responses  provide  supporting 
data  for  analyzing  responses  to  the  CV  question.  The  Rating  Form  assigns  5 
points  to  this  dimension. 

(12)   Was  the  survey  mode  appropriate? 

Mail  surveys  are  attractive  to  CV  researchers  because  they  are  the  least 
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expensive  of  the  major  modes.  There  also  may  be  methodological  reasons  for 
choosing  a  mail  approach.  Mail  is  preferred  by  some  researchers  because  mail 
instruments  give  them  complete  control  over  the  information  and  context 
communicated  to  potential  respondents.  Other  researchers  shy  away  from  mail 
surveys  because  of  limited  reading  skills  of  potential  respondents  from  the 
general  population,  even  in  the  US  and  other  countries  where  literacy  rates  are 
relatively  high.  Furthermore,  even  the  more  literate  respondents  may  be 
reluctant  to  try  to  read  and  digest  large  amounts  of  written  material  about  the 
intervention  and  its  consequences. 

Telephone  interviews  are  more  expensive  than  mail  surveys  and  are  limited 
by  the  amount  of  information  and  context  that  can  be  communicated  in  a  brief 
phone  call.  Effective  communication  may  require  presenting  respondents  with 
visual  aids  such  as  charts,  graphs,  and  photographs.  This  will  not  be  feasible 
in  a  survey  conducted  by  phone.  On  the  other  hand,  it  is  somewhat  easier  to  get 
reasonably  high  response  rate  by  phone  than  by  mail  and  reading  skills  are  not 
involved . 

Personal  interviews  can  make  communication  easier  because  of  the  personal 
contact  they  provide.  More  information  can  normally  be  provided  than  would  be 
possible  by  mail  or  over  the  phone.  Conducting  surveys  in  person  may  increase 
response  rates.  However,  in-persons  surveys  with  high  response  rates  are  very 
expensive . 

From  the  perspective  of  content  validity  assessment,  survey  mode  must  be 
appropriate  for  the  study  goals  and  the  complexity  of  the  information  and 
context  that  need  to  be  communicated.  If  the  goal  is  to  value  a  recreational 
experience  that  is  quite  familiar  to  respondents,  for  example,  then  a  mail 
survey  may  be  quite  adequate.  If  the  goal  is  to  estimate  the  non-use  values  of 
injuries  associated  with  a  release  of  oil  or  a  toxic  material  that  had  complex 
environmental  impacts,  then,  as  the  NOAA  Panel  recommended,  personal  interviews 
would  appear  to  have  a  large  advantage.  Using  a  mail  or  telephone  survey  in 
such  a  situation  would  be  grounds  for  questioning  the  content  validity  of  a 
study.   This  is  not  to  say  that  a  mail  or  telephone  survey  would  necessarily  be 
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ruled  out,  but  in  the  eyes  of  many  CV  researchers  an  extra  burden  of  proof  would 
rest  on  the  study  team  to  provide  evidence  that  the  mail  or  telephone  procedures 
worked  well. 

(13)   Were  qualitative  research  procedures,  pretests,  and  pilots  sufficient  to 
find  and  remedy  identifiable  flaws  in  the  instrument  and  associated  materials? 

Once  survey  designers  have  roughed  out  an  instrument  and  related  documents 
based  on  their  perceptions  of  how  respondents  will  react,  qualitative  research 
is  often  needed  to  refine  the  instrument.10  For  example,  focus  groups  may  be 
asked  to  complete  a  mail  survey  and  then  to  discuss  it  with  the  group  leader. 
Or,  an  instrument  designed  for  personal  interviews  can  be  tested  in  observed 
personal  interviews.  During  the  interview  and  afterwards  in  debriefing  sessions 
with  the  subjects,  researchers  can  try  to  identify  ways  that  the  instrument  is 
being  misinterpreted  or  that  information  provided  is  incomplete  or  otherwise 
inadequate.  Possible  improvements  can  be  tested  out  as  well.  Qualitative 
testing  should  not  only  involve  verbal  materials  but  also  any  photographs  or 
other  visual  aids. 

Though  we  share  the  now  commonly  accepted  view  that  qualitative  research 
can  be  invaluable  in  the  design  of  CV  surveys,  its  limitations  in  establishing 
validity  must  also  be  recognized.  The  typical  report  will  include  only  a  terse 
statement  such  as,  "Four  focus  groups  were  conducted."  Little  or  nothing  is 
said  about  the  extent  to  which  the  focus  groups  succeeded  in  working  the  "bugs" 
out  of  the  instrument  and  associated  documents.  Standard  procedures  for 
applying  qualitative  research  tools  and  reporting  the  results  do  not  exist,  or 
at  least  have  not  found  their  way  into  everyday  practice.  This  may  be  a 
fruitful  area  for  research.  In  the  meantime,  reviewers  of  CV  studies  may  have 
to  take  the  "quality"  of  qualitative  work  more  or  less  at  face  value.  An 
exception  may  be  litigation,  where  details  about  procedures  and  results  can  be 
ferreted  out  from  audio  and  video  records  and  written  reports  entered  into 
evidence  and  from  depositions  and  cross  examination. 


10   Circulating  the  instrument  to  knowledgeable  colleagues  for  review  may 
also  be  helpful. 
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Formal  pretesting  and  piloting11  of  a  nearly  finished  instrument  may  also 
help  to  improve  it.  Analysis  of  responses  can  help  identify  problems. 
Interviewers  often  help  to  identify  places  where  in-person  and  telephone 
questionnaires  can  be  improved.  Interviewers  can  also  be  instructed  to  record 
verbatim  any  remarks  about  the  survey  questions  and  information  presented. 
Though  less  effective,  subjects  in  mail  pretests  can  be  asked  to  write  comments 
in  the  margins .  A  subsample  can  be  contacted  by  telephone  to  probe  for  flaws 
in  a  proposed  mail  instrument.  Through  such  procedures,  the  study  design  can 
be  tested  out  under  field  conditions,  enhancing  content  validity  in  the  process. 

CV  can  been  applied  in  such  diverse  settings  that  generalizations  are  not 
possible  about  how  much  qualitative  research,  pretesting,  and  piloting  are 
needed  in  any  particular  case.  At  one  extreme  are  studies  of  widely  researched, 
relatively  straightforward  issues,  where  instruments  may  require  relatively 
little  preliminary  testing.  A  recreational  fishing  study  would  be  one  example. 
There,  researchers  could  count  on  current  users,  at  least,  to  have  a  high  degree 
of  knowledge  about  relevant  resources.  They  could  draw  on  the  wealth  of 
knowledge  on  fishery  valuation  found  in  many  past  studies.  At  the  other  extreme 
would  be  non-use  studies  involving  resources  that  may  be  unfamiliar  to  large 
numbers  of  respondents.  Hence,  judgements  about  how  much  preliminary  work 
should  have  been  performed  must  take  the  specific  circumstances  into  account. 
The  more  complex  is  the  change  in  environmental  amenities  to  be  evaluated  and 
the  less  familiar  are  the  subjects  likely  to  be  with  the  amenities,  the  more 
weight  should  be  placed  on  whether  or  not  qualitative  research,  pretests,  and 
pilots  were  conducted.  Up  to  5  points  are  to  be  assigned  to  this  aspect  under 
our  version  of  the  Rating  Form. 


Pretests  are  distinguished  from  pilots  by  their  small  and  more 
convenient  samples.  The  goal  of  pretests  is  to  identify  major  problems  with  the 
instrument,  associated  documents,  and  procedures  that  will  become  apparent  even 
for  small  samples.  Question  wording  that  will  confuse  large  numbers  of 
respondents  or  lead  to  large  item  non-response  may  become  apparent,  for  example. 
Pilot  studies  allow  for  more  fine  tuning  of  question  and  information  wording  and 
for  preliminary  investigations  of  the  likely  statistical  properties  of  final 
results . 
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(14)  Given  study  objectives,  how  adequate  were  procedures  employed  to  choose 
study  subjects,  assign  them  to  treatments  (if  applicable),  and  encourage  high 
response  rates? 

Adequate  population  definition,  sampling,  and  survey  procedures  depend 
upon  study  objectives.  To  allow  for  this  fact,  we  will  distinguish  between  two 
different  kinds  of  studies.  Some  studies  involve  exclusively  methodological 
goals.  One  might,  for  example,  design  a  study  to  compare  the  results  of  open- 
ended  CV  questions  with  those  from  a  bidding  game  for  the  same  amenity.  Other 
studies  have  as  a  major  goal  the  estimation  of  values  for  a  population  of 
individuals,  either  in  the  context  of  policy  analysis  or  litigation.  For 
convenience,  we  will  term  the  former  "methodological  studies"  and  the  latter 
"applied  studies."  Applied  studies  may  also  have  methodological  goals.  Their 
distinguishing  feature  is  that  they  ultimately  hope  to  generalize  results  from 
a  sample  or  samples  to  the  population. 

For  methodological  studies,  procedures  for  choosing  subjects  and 
allocating  them  among  treatments  are  mostly  a  matter  of  common  sense.  Where  one 
is  trying  out  new  CV  procedures  or  testing  some  hypothesis  about  CV  results,  one 
would  hope  to  eventually  conclude  something  about  how  CV  would  perform  in 
applied  studies  under  normal  circumstances.  Hence,  one  might  not  want  to  choose 
kindergartners  as  subjects.  Content  validity  might  suffer  a  bit  if  only 
undergraduates  were  used  as  subjects  since  they  may  not  be  typical  of  subjects 
in  normal  CV  studies.  However,  at  the  other  extreme,  fastidious  sampling  from 
the  general  population  or  some  other  group  would  normally  not  be  required  for 
methodological  studies.  The  sample  selection  bias  inherent,  for  example,  in 
obtaining  subjects  from  the  general  population  willing  to  come  to  a  laboratory 
and  participate  in  an  experiment  would  probably  not  be  a  large  red  flag  in  most 
researchers'  judgement.  In  studies  involving  multiple  treatments,  assignments 
to  cells  should,  of  course,  be  random.  In  field  (as  opposed  to  laboratory) 
studies,  follow-up  procedures  to  increase  response  rates  could  normally  be  less 
rigorous  than  in  an  applied  study.  In  sum,  the  validity  of  implementation  steps 
for  methodological  studies  focus  mainly  on  the  reasonableness  of  the  procedures 
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in  light  of  the  study  goals.12 

Applied  studies,  on  the  other  hand,  must  satisfy  much  more  rigorous 
standards  so  far  as  sampling  and  response  rates  are  concerned.  Either  random 
or  stratified  random  samples  are  required  which  will  support  extrapolation  of 
value  estimates  from  the  sample  to  the  population.  Furthermore,  potential  non- 
response  bias  must  be  addressed  either  directly  by  gaining  a  high  response  rate 
or  indirectly.  Indirect  approaches  include  attributing  zero  values  to  non- 
respondents  and  various  methods  to  more  assess  the  extent  of  non-response  bias. 
An  example  of  the  latter  would  be  to  compare  reported  socioeconomic 
characteristics  of  respondents  with  published  statistics  for  their  Census 
tracts . 

Up  to  5  points  can  be  allocated  to  a  study  depending  on  how  well  it  dealt 
with  sampling,  non-response,  and  related  details  within  the  context  of  its 
overall  objectives. 

(15)  Was  the  econometric  analysis  adequate? 

Once  the  responses  are  in,  high  content  validity  requires  that  the  data 
be  competently  coded  and  entered  into  computer  files  for  analysis.  Success  here 
again  is  simply  a  matter  of  using  common  sense.  For  example,  verification  of 
the  data  is  often  facilitated  by  entering  it  twice  and  reconciling  the  files. 

The  analysis  itself  should  employ  econometric  procedures  that  are 
appropriate  for  the  inferences  that  are  drawn.  Economists  are  normally  well 
trained  in  this  area.  Assessing  this  aspect  of  content  validity  is  simply  a 
matter  of  verifying  that  analysts  have  employed  their  tools  properly. 

(16)  How  adequate  are  the  written  materials  from  the  Study? 

The  final  step  in  study  execution  involves  reporting  of  study  design  and 


12  The  reader  will  no  doubt  have  noted  that  this  relaxed  attitude  toward 
methodological  studies  does  not  carry  over  to  the  theoretical  and  empirical 
aspects  of  study  design  discussed  earlier  in  our  paper.  In  fact,  one  might 
argue  that  the  requirements  for  design  (as  opposed  to  implementation)  of 
methodological  studies  should  be  even  more  rigorous  than  for  applied  studies. 
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execution  procedures  and  study  results.  Needs  here  will  vary  depending  on  study 

goals  and  the  expected  audience  for  the  report.  A  journal  article  might  stress 

technical  and  methodological  details,  while  a  report  for  policy  makers  might 

stress  final  results  and  policy  implications.   As  previously  noted,  content 

validity  assessment  requires  rather  complete  reporting.   Studies  that  do  not 

provide  thorough  and  complete  reports  can  not  be  presumed  to  have  high  content 

validity.  This  no  doubt  was  part  of  the  motivation  for  the  NOAA  Panel's  rather 

severe  requirements  for  report  results: 

Every  report  of  a  CV  study  should  make  clear  the  definition  of  the 
population  sampled,  the  sampling  frame  used,  the  sample  size,  the  overall 
sample  non-response  rate  and  its  components  (e.g.,  refusals),  and  item 
non-response  on  all  important  questions.  The  report  should  also  reproduce 
the  exact  wording  and  sequence  of  the  questionnaire  and  of  other 
communications  to  respondents  (e.g.,  advance  letters).  All  data  from  the 
study  should  be  archived  and  made  available  to  interested  parties  .  .  . 

In  fact,  we  have  already  noted  that  inadequate  reporting  could  be  a  fatal  flaw 

if  it  is  serious  enough. 

(17)  Are  there  other  concerns  relating  to  the  design  and  execution  of  the  study 
that  have  not  already  been  addressed? 

At  this  point,  we  confront  two  problems.  First,  CV  study  procedures  still 
involve  many  dimensions  about  which  widely-respected  researchers  disagree. 
Second,  the  circumstances  under  which  studies  are  conducted  may  dictate  some  of 
the  design  issues  that  arise.  For  example,  timing  of  survey  administration  may 
be  an  issue  in  some  circumstances  but  not  in  others.  As  a  case  in  point, 
suppose  injuries  due  to  a  large  oil  spill  are  to  be  valued.  Doing  a  CV  study 
too  soon  afterward  might  be  challenged  on  the  grounds  that  respondents  are  still 
in  a  state  of  shock  and  outrage,  and  may  get  carried  away  by  the  emotions  of  the 
moment  in  answering  CV  questions.  Resulting  value  estimates  would  be  of 
questionable  validity  because  they  might  not  be  robust  over  time. 

Question  17  is  designed  to  allow  reviewers  to  assign  up  to  10  points  based 
on  their  judgments  about  issues  not  raised  elsewhere  in  the  Rating  Form, 
including  those  that  were  more  or  less  unique  to  the  particular  study. 

Assigning  Points 

As  we  have  already  pointed  out,  reviewers  may  disagree  about  the  number 
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of  point  we  have  allocated  to  each  question.  If  so,  they  should  adapt  the  form 
accordingly.  The  relative  importance  of  the  issues  raised  in  the  form  are 
likely  to  be  debatable  for  a  long  time  to  come.  We  would  encourage  users  of  the 
form  to  be  consistent  in  how  many  points  they  allocate  to  each  question  across 
the  studies  they  review.  This  could  provide  informative  comparisons  as  the  same 
reviewer  applies  the  same  form  to  different  studies.  It  would  be  possible  to 
tell  not  only  which  studies  rate  higher  in  the  reviewers  estimation,  but  why. 

On  any  particular  item  in  the  form,  some  studies  may  easily  receive  full 
credit  simply  because  an  issue  did  not  arise  in  that  particular  case.  Other 
studies  may  lose  points  for  having  neglected  to  one  degree  or  another  the  issue 
or  issues  highlighted  in  the  question.  Under  particularly  difficult 
circumstances,  a  study  may  receive  a  low  score  despite  competent  efforts  to 
overcome  the  difficulty  in  question.  This  would  simply  reflect  the  fact  that 
CV  may  be  applied  in  very  different  settings.  It  should  be  more  difficult  to 
establish  the  content  validity  of  CV  studies  in  some  situation  than  in  others. 

Thus  far,  this  paper  has  had  a  rather  narrow  focus.  It  has  limited  itself 
content  validity  and  to  validity  assessment  at  the  level  of  the  individual 
study.  An  important  and  perhaps  somewhat  subtle  point  is  present  here:  A  study 
must  first  establish  its  own  validity  before  it  can  be  the  basis  for  drawing 
conclusions  about  the  overall  validity  of  the  CV  method.  Assessing  the  validity 
of  individual  studies  is  partly  a  matter  of  content  validity,  but  it  also 
involves  construct  and  criterion  validity.  Furthermore,  it  would  be  a  mistake 
to  draw  strict  boundaries  between  the  triad  of  approaches  to  validity  testing. 
Rather,  they  are  intricately  linked.  Our  overview  of  content  validity  would  be 
incomplete  if  we  did  not  consider  the  close  linkages  and  complementarities 
between  content  validity,  on  the  one  hand,  and  construct  and  criterion  validity, 
on  the  other.  Then,  the  "rules  of  evidence,"  so  to  speak,  for  considering  the 
overall  validity  of  the  CV  method  will  be  considered  briefly,  with  particular 
emphasis  on  the  role  of  content  validity. 


25 
II.   RELATIONSHIPS  BETWEEN  CONSTRUCT,  CRITERION,  AND  CONTENT  VALIDITY 

While  content  validity  assessment  focuses  on  study  design  and  execution, 
construct  validity  assessment  involves  statistical  testing  of  hypotheses  about 
relationships  between  responses  to  CV  questions  and  other  measures  as  predicted 
by  theory.  Mitchell  and  Carson  (1989)  consider  two  types  of  construct  validity 
--  convergent  and  theoretical  validity.  The  first  deals  with  whether  CV  values 
"converge"  with  other  measures  of  the  same  true  value.  For  example,  one  might 
compare  a  CV  estimate  of  the  value  of  access  to  an  outdoor  recreational  site 
with  a  value  estimate  based  on  a  travel -cost  model  of  recreational  demand  for 
the  site. 

Theoretical  validity  tests  involve  hypotheses  about  relationships  between 
CV  estimates  of  value  and  other  variables  that  theory  indicates  might  be  related 
to  the  true  value  in  some  way.  For  example,  one  test  could  involve  the 
relationship  between  CV  values  and  income.  Theoretical  validity  tests  come  in 
two  principal  forms.  Traditionally,  CV  studies  have  often  estimated  valuation 
equations  where  expressions  of  willingness  to  pay  are  regressed  on  income,  other 
socioeconomic  variables,  attitude  measures,  and  past  behavior  that  might  enhance 
or  detract  from  the  value  of  the  intervention  (e.g.,  participation  in  outdoor 
recreation  involving  the  amenity  being  valued) .  The  NOAA  Panel  proposed  a 
variant  of  this  approach  when  it  recommended,  in  the  context  of  damage 
assessment,  that  responses  to  the  primary  valuation  question  be  cross-tabulated 
with  income,  prior  knowledge  of  the  site,  attitudes  toward  the  environment, 
distance  to  the  site  and  other  variables. 

The  other  approach  to  theoretical  validity  testing  involves  the  valuation 
of  two  or  more  variants  on  the  same  set  of  amenities  and  the  testing  of 
hypotheses  about  the  relationships  between  these  estimated  values  based  on 
neoclassical  microeconomic  theory  (Diamond  et  al .  1992;  McFadden  1994).  For 
example,  consider  two  levels  of  injury  from  an  oil  spill  and  assume  that  most 
potential  study  subjects  consider  the  second  level  of  injury  to  be  more  severe 
than  the  first.  Then,  barring  satiation,  study  subjects  should  place  a  higher 
monetary  value  on  the  second.   Testing  this  hypothesis  with  contingent  values 
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for  the  two  levels  of  injury  is  the  so-called  scope  test,  as  identified  by  the 
NOAA  Panel.   Other  such  tests  can  be  devised.   One  might,  for  example,  try  to 
test  CV  values  for  transitivity  and  varying  degrees  of  availability  of 
substitutes  and  complements. 

To  the  extent  that  contingent  values  are  significantly  related  either  to 
other  contingent  values  or  to  other  variables  as  predicted  by  theory,  this 
supports  the  validity  of  the  study  since  it  implies  that  respondents' 
expressions  of  value  are  emanating  at  least  to  some  degree  from  the  same 
processes  that  theory  attempts  to  model.  This  buttresses  the  interpretation  of 
results  as  estimates  of  true  values. 

Failure  to  pass  a  construct  validity  test  may  indicate  flaws  in  the  design 
and  execution  of  study  being  scrutinized,  including  flaws  that  did  not  surface 
during  content  validity  assessment.  In  this  way,  content  validity  assessment 
and  construct  validity  tests  may  be  mutually  reinforcing.  On  the  other  hand, 
a  study  which  appears  to  have  relatively  low  content  validity  may  redeem  itself, 
at  least  to  some  extent,  through  a  strong  showing  in  construct  validity  tests. 
Low  content  validity  means  only  that  a  number  of  doubts  exist  about  a  study's 
procedures.  If  resulting  values  nevertheless  seem  to  be  related  to  each  other 
and  to  other  parameter  estimates  in  theoretically  predictable  ways,  this  would 
indicate  that  the  potential  flaws  identified  during  content  validity  assessment 
may  not  have  been  as  problematical  as  was  feared.  Of  course,  the  strongest 
studies  are  still  the  ones  that  test  out  well  from  the  perspectives  of  both 
content  and  construct  validity. 

To  conduct  a  test  of  criterion  validity,  one  needs  to  "have  in  hand  a 
criterion  which  is  unequivocally  closer  to  the  theoretical  construct  than  the 
measure  whose  validity  is  being  assessed"  (Mitchell  and  Carson  1989,  192)  . 
Applied  to  CV,  the  criterion  would  be  some  measure  of  value  that  is  arguably 
closer  to  the  true  value.  For  example,  given  the  high  level  of  credibility  of 
market  transactions  as  value  indicators,  one  might  compare  market  values  and 
contingent  values.  As  an  approximation  to  this  ideal,  values  from  actual  cash 
transactions  completed  in  "simulated  markets"  as  part  of  field  and  laboratory 
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experiments  have  been  compared  to  contingent  values  (Bohm  1972;  Bishop  and 
Heberlein  1979;  Dickie  et  al .  1987;  Coursey,  Hovis  and  Schulze  1987;  Kealy, 
Montgomery,  and  Dovidio  1990;  Bishop,  Welsh,  and  Heberlein,  1993;  Neill  et  al . 
1994;  Boyce  et  al .  1989;  Kealy,  Montgomery  and  Dovidio  1990;  Duf field  and 
Patterson  1992;  Seip  and  Strand  1992;  Champ  et  al .  1994). 13 

Criterion  validity  testing  occurs  mainly  in  methodologically  oriented 
studies.  If  it  were  possible  to  obtain  simulated  market  values  under  conditions 
where  most  applied  CV  studies  are  conducted,  there  would  be  no  need  for  CV.  One 
could  simply  use  the  simulated  market  values  in  policy  analysis  and  damage 
assessments.  Instead  nearly  all  applied  studies  must  rely  on  CV;  simulated 
market  studies  are  only  conducted  under  special,  unusual  circumstances.  This 
means  that  the  link  between  simulated  market  studies  and  the  validity  of  values 
estimated  in  any  particular  applied  study  are  indirect.  That  is,  results  where 
CV  performed  well  or  badly  in  simulated  market  experiments  are  used  to  infer 
whether  values  derived  as  part  of  an  entirely  separate  CV  study  are  or  are  not 
accurate.  Thus,  we  have  moved  from  validity  assessment  at  the  level  of  the 
individual  study  toward  assessment  of  the  CV  method  more  generally,  a  topic  that 
will  occupy  us  in  the  next  section.  First,  a  couple  of  simple  but  important 
observations  about  relationships,  at  the  level  of  the  individual  study,  between 
criterion  validity  and  content  validity  will  be  made. 

Notice,  first  of  all,  the  importance  of  content  validity  in  criterion 
validity  research:  high  content  validity  of  simulated  market-CV  comparison 
studies  is  fundamental  to  obtaining  results  that  are  potentially  generalizable . 
If  either  simulated  market  or  CV  treatments  involve  procedures  that  are  of 
doubtful  content  validity,  then  within-study  conclusions  about  the  validity  of 
the  experiment's  contingent  values  are  suspect.  Either  the  criterion  or  the 
contingent  values  or  both  are  flawed. 


13  One  might  also  use  CV  referendum  questions  to  predict  outcomes  of  actual 
referenda  as  was  done  by  Carson,  Hanemann,  and  Mitchell  1986.  For  a  discussion 
of  the  methodological  issues,  see  Bishop  et  al .  (1994). 
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Second,  simulated  market  experiments  with  high  content  validity  should  be 
fertile  ground  for  improving  the  criteria  applied  in  content  validity 
assessments.  If  certain  procedures  seem  to  improve  CV's  performance  in 
comparison  to  simulated  markets,  then  those  same  procedures  might  be  expected 
to  improve  accuracy  in  applied  CV  studies.  Similar  conclusions  might  follow 
from  value  comparisons  where  CV  performed  poorly. 

III.  CONTENT  VALIDITY  AND  THE  OVERALL  VALIDITY  OF  CV 
Content  validity  as  that  term  has  been  used  here  is  strictly  a 
characteristic  of  individual  studies.  It  would  be  nonsensical  to  argue  that  the 
CV  method  as  a  whole  has  or  does  not  have  content  validity.  On  the  other  hand, 
construct  validity  testing  and  criterion  validity  studies  may  form  the  basis  for 
generalizations  about  the  method  as  a  whole.  Such  generalizations  will  be 
tenable,  however,  only  to  the  extent  that  they  are  based  on  individual  studies 
with  high  content  validity.  This  is  a  point  that  is  often  overlooked  in  the 
current  literature.  Researchers  are  trying  to  reach  conclusions  about  the 
validity  of  the  method  without  establishing  the  validity  of  the  studies  they 
intend  to  use  as  a  basis  for  such  conclusions.  A  couple  of  examples  will 
illustrate . 

Kahneman  and  Knetsch  (1992,  58)  concluded  that  embedding  effects  are 
"perhaps  the  most  serious  shortcoming  of  CVM."  Several  studies  were  presented 
to  support  this  conclusion,  yet  very  serious  doubts  can  be  raise  about  the 
content  validity  of  these  studies.  The  study  involving  embedding  of  emergency 
medical  personnel  and  equipment,  disaster  preparedness,  and  the  environment  in 
general  will  serve  as  an  especially  good  illustration  because  it  was  the 
centerpiece  of  the  Kahneman  and  Knetsch  article.  At  all  three  levels  of 
embedding,  the  CV  exercise  involved  vaguely  defined  products.  The  attributes 
of  the  environment,  disaster  preparedness,  and  emergency  personnel  and  equipment 
that  were  potentially  relevant  to  subjects  had  not  been  investigated  as  part  of 
the  survey  preparation.  No  attempt  was  made  even  to  describe  a  set  of 
potentially  relevant  attributes  arrived  at  a  priori .   Instead,  respondents  were 
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left  to  decide  for  themselves  what  the  interviewer  meant  by  the  environmental 
improvements,  increases  in  disaster  preparedness,  and  increases  in  emergency 
medical  personnel  and  equipment.  In  none  of  their  three  treatments  was  the 
baseline  or  the  proposed  levels  of  attribute  provision  clearly  defined.  These 
are  clear  violations  of  the  principles  underlying  Question  4  of  the  Rating  Form. 
Willingness  to  pay  was  to  be  expressed  in  terms  of  vaguely  defined  taxes, 
prices,  and  user  fees  with  the  money  to  be  placed  in  some  sort  of  "special 
fund."  This  lack  of  specificity  not  only  raises  doubts  whether  study  subjects 
were  well  enough  informed  about  the  context  of  valuation,  but  also  leads  to 
theoretical  questions  about  the  incentive  properties  of  the  payment  vehicle. 
This  seriously  violates  principles  underlying  Question  5.  In  our  judgement, 
these  flaws  are  sufficiently  serious  to  be  considered  fatal.  Procedures  in  the 
other  studies  Kahneman  and  Knetsch  draw  upon  are  less  well  described  in  the 
published  article.  To  the  extent  that  all  of  their  studies  have  similar  flaws, 
their  general  conclusions  about  inherent  flaws  in  CV  rest  on  a  very  shaky 
empirical  foundation. 

Or,  consider  the  content  validity  of  the  data  set  on  the  value  of  avoiding 
logging  of  wilderness  areas  in  the  Western  U.S.  that  serves  as  the  empirical 
basis  for  papers  by  Hausman  et  al .  (1992),  Hausman  et  al  (1993),  and  McFadden 

(1994) .  Diamond  et  al .  (1992,  15)  felt  they  had  adequate  evidence  to  conclude 
that,  in  general,  "whatever  contingent  valuation  surveys  are  measuring,  they  are 
not  measuring  consumers'  preferences  for  environmental  amenities."   McFadden 

(1994),  to  his  credit,  was  more  guarded,  but  still  believed  that  the  data  were 
strong  enough  to  warrant  conclusions  about  the  method  as  a  whole.  He  pointed 
out  (McFadden  1994  689) ,  "The  results  call  into  question  the  reliability  of  the 
CV  method  for  estimating  existence  values."  We  would  question  whether  the 
survey  used  to  gather  the  data  has  sufficient  content  validity  to  support  such 

far-reaching  conclusions. 

A  total  of  eight  different  treatments,  summarized  in  Table  2  of  McFadden 

(1994  696),  were  involved.   Depending  on  the  treatment,  participants  were  told 

that  one,  seven,  eight,  nine,  or  57  wilderness  areas  would  be  logged  somewhere 
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in  Colorado,  Wyoming,  Montana,  or  Idaho.  Subjects  were  told  that,  "One  proposal 
for  commercial  development  involves  allowing  timber  companies  to  harvest  the 
mature  timber  at  a  rate  of  1%  per  year,  indefinitely.  This  would  necessitate 
building  roads  and  bringing  in  mechanical  equipment."  (McFadden  1994  696) 
Nothing  else  was  said  about  the  details  of  logging.  The  eight  intervention  to 
be  valued  involved  prevention  of  logging  in  one  or  more  areas.  The  area  or 
areas  that  might  be  logged  were  specified  by  name,14  but  little  additional 
information  was  given  about  them  beyond  their  sizes  and  general  location.  For 
example,  nothing  was  said  about  whether  clear-cutting  or  selective  cutting  would 
be  practiced.  What  steps  would  or  would  not  be  taken  to  protect  water  quality 
and  fish  and  wildlife  habitats  were  not  described.  One  of  the  treatments,  which 
played  an  important  role  in  the  analysis  of  Diamond  et  al . ,  involved  variation 
in  the  number  of  other  wilderness  areas  that  might  also  be  logged,  but  nothing 
else  was  said  about  these  other  areas  except  that  at  least  one  of  them  would  be 
in  the  subject's  home  state.  The  purpose  of  the  various  interventions  was 
explicitly  to  reduce  the  deficit.  Opening  up  such  a  high-profile  issue  in  the 
scenario  likely  would  have  piqued  the  interest  of  many  respondents,  yet  nothing 
more  was  said  about  it.  For  example,  nothing  was  said  about  how  much  each 
intervention  would  reduce  the  deficit.  All  this  vagueness  in  the  scenario  would 
count  heavily  against  these  studies  under  Review  Form  Questions  4  and  5. 

Other  concerns  should  be  at  least  mentioned.  The  opening  up  of  designated 
wilderness  areas  to  logging  would  be  big  news  in  the  Westis  and  payment  of  an 
income  tax  surcharge  to  save  one  or  a  few  wilderness  areas  would  be 
unprecedented  in  the  experience  of  respondents.  These  aspects  raise  doubts 
about  respondent  acceptance  of  and  belief  in  the  scenario  (Rating  Form  Questions 


14  In  the  case  where  57  wilderness  areas  were  valued,  there  was  no  need  to  specify  them  by  name  since 
this  constituted  all  of  the  designated  wilderness  areas  in  the  four  states. 

15  McFadden  (1994  696)  points  out  that  "This  resource  issue  was  chosen  because  at  the  time  of  the  study 
in  1990  there  was  active  discussion  in  Congress  and  the  media  in  the  western  U.S.  regarding  logging  on 
government  lands "  However,  these  proposals  did  not  involve  designated  wilderness  areas.  Increased 
logging  on  government  holdings  outside  the  wilderness  system  was  controversial.  A  real  proposal  to  opening 
up  even  one  designated  wilderness  areas  would  have  been  one  of  the  major  environmental  confrontations 
of  the  century 
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9  and  10)  .  Many  CV  researchers  would  question  whether  an  issue  of  this 
complexity  could  be  effectively  addressed  at  all  in  the  context  of  a  telephone 
interview  (Rating  Form  Question  12) .  Nothing  is  said  in  the  papers  reporting 
using  these  data  sets  about  qualitative  work  to  identify  attributes  of 
wilderness  areas  relevant  respondents  or  potential  flaws  in  the  instrument 
(Questions  4  and  13) .  Nor  are  results  of  pretests  and  pilots  (if  any)  of  the 
instrument  reported  (Question  14). 

Like  Kahneman  and  Knetsch,  Diamond  et  al .  attempted  to  draw  general 
conclusions  about  the  deficiencies  of  the  CV  method  based  on  research  of 
questionable  content  validity.  McFadden  (1994)  certainly  raises  some  questions 
about  the  validity  of  the  particular  CV  application  he  chose  to  analyze,  but 
whether  these  shortcomings  would  appear  in  higher  quality  data  sets  remains  to 
be  seen.  Researchers  can  and  should  work  toward  generalizations  about  the 
validity  of  CV.  However  the  studies  used  as  foundations  for  such 
generalizations  need  to  possess  a  high  degree  of  content  validity. 

V.   SUMMARY 

In  this  paper,  we  have  attempted  to  clarify  and  systematize  an  approach 
to  content  validity  assessment.  A  content  valid  CV  study  is  rooted  throughout 
in  a  clear  theoretical  definition  of  the  true  value  of  the  intervention.  At  the 
heart  of  such  a  study  will  be  its  scenario.  Based  on  well -documented  evidence 
of  the  respondent-relevant  effects  of  the  intervention,  a  sound  scenario 
effectively  communicates  the  potential  effects  of  the  intervention  to 
respondents.  It  includes  whatever  information  they  need  regarding  substitutes 
for  the  environmental  resources  in  question  and  may  need  to  remind  respondents 
of  their  budget  constraints.  It  also  includes  a  fully  specified  and  incentive 
compatible  context  for  valuation.  It  does  all  this  in  ways  that  potential 
respondents  will  accept  and,  if  possible,  believe. 

Looking  beyond  the  scenario,  a  content  valid  survey  instrument  will 
include  well-designed  questions  to  support  construct  validity  testing.  The  mode 
chosen  for  administering  the  survey  will  be  appropriate  to  the  complexity  of  the 
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scenario  and  the  ultimate  goals  of  the  study.  Prior  to  administration,  the 
instrument  will  have  been  subjected  to  thorough  qualitative  investigation, 
pretesting,  and,  if  needed,  piloting  to  work  out  as  many  bugs  as  possible. 
Econometric  analysis  of  the  results  will  have  been  adequately  performed  and 
final  results  effectively  reported. 

To  the  extent  that  studies  fall  short  of  these  ideals,  they  may  still  have 
substantial  merits.  Content  validity  is  a  normally  matter  of  degree.  However, 
some  studies  will  fall  below  minimal  standards  and  be  judged  content  invalid. 

Content  validity  is  one  leg  of  a  tripod  of  approaches  upon  which  the 
ultimate  validity  of  any  study  will  be  judged.  Construct  validity  assessment 
and  criterion  validity  tests  form  the  other  two  legs.  However,  the  three 
approaches  to  validity  are  not  separate  but  mutually  reinforcing.  Construct 
validity  assessment  involves  theory-based  hypothesis  testing.  Data  for  these 
tests  comes  partly  from  other  questions  in  the  survey.  Such  questions  must 
themselves  have  a  high  degree  of  content  validity.  Criterion  validity  tests 
allow  insights  into  the  overall  validity  of  the  CV  method  and  may  help  refine 
content  validity  criteria,  provided  they  are  well  designed  and  executed. 

We  do  not  expect  any  one  study  or  a  small  number  of  studies  to  be 
conclusive  regarding  the  overall  validity  of  CV.  Too  many  studies  have  obtained 
favorable  results  in  construct  validity  testing  and  criterion  validity 
experiments  to  make  viable  a  flat  negative  verdict  regarding  the  method.  On  the 
other  hand,  few  would  be  willing  at  this  stage  to  give  CV  an  unqualified 
endorsement.  Too  many  doubts  and  anomalous  empirical  results  remain.  As  the 
research  continues  to  accumulate,  individual  studies  need  to  strive  for  high 
levels  of  content  validity  if  sound,  empirically-based  conclusions  are  to  be 
forthcoming . 

CV  has  attracted  so  much  attention  because  of  it  potential  for  including 
a  much  wider  set  of  phenomena  under  the  umbrella  of  applied  welfare  economics. 
Hard  to  measure  use  values  and  even  non-use  values  may  be  includable.  It  has 
been  so  controversial  partly  because  counting  such  values  in  benefit-cost 
analyses  and  damage  assessments  threaten  entrenched  interests,  but  the  roots  of 
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the  controversy  are  more  fundamental  than  that.  To  admit  evidence  on  values 
from  surveys,  where  revealed  preference  data  have  historically  dominated,  is  a 
big  step  for  economists .  Whether  survey  evidence  on  economic  values  is 
"admissible"  to  applied  welfare  analysis  or  not  should  be  approached  in  a 
cautious,  but  open-minded  way  based  on  carefully  thought  out  "rules  of 
evidence."  Thus  do  the  social  sciences  progress.  Drawing  on  its  sister 
disciplines,  economics  can  evaluate  this  new  direction  based  on  content, 
construct,  and  criterion  validity.  Content  validity  is  central  to  the  process 
and  deserves  more  attention  if  real  progress  is  to  be  made. 
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Figure  1 
CONTENT  VALIDITY  RATING  FORM  FOR  CONTINGENT  VALUATION  STUDIES 

(1)  Do  study  procedures  contain  flaws  that  are  so  serious  that  they  would  rule 
out  use  of  the  results  to  achieve  study  goals?  (If  yes,  assign  zero  total 
points  below,  record  reasons  under  written  comments,  and  terminate  the 
review . ) 

(2)  Was  the  true  value  clearly  defined?   (5  points) 


(3)  Were  the  environmental  attributes  relevant  to  potential 
subjects  fully  identified?   (10  points)   

(4)  Were  the  potential  effects  of  the  intervention  on 
environmental  attributes  and  other  economic  parameters 
adequately  documented  and  communicated?   (10  points)    .  . 

(5)  Were  respondents  aware  of  the  existence  and  status  of 
environmental  substitutes?   (5  points)    

(6)  Was  the  budget  constraint  adequately  stressed?   (5  points) 

(7)  Was  the  context  for  valuation  fully  specified  and 
incentive  compatible?   (5  points)   

(8)  Does  the  CV  question  elicit  willingness  to  pay?  (5  points) 

(9)  Did  survey  participants  accept  the  scenario?   (5  points) 

(10)  Did  survey  respondent  believe  the  scenario?   (5  points 

(11)  How  adequate  and  complete  were  survey  questions  other 
than  those  designed  to  elicit  values?   (5  points)   .... 

(12)  Was  the  survey  mode  appropriate?   (10  points)   

(13)  Were  qualitative  research  procedures,  pretests,  and 
pilots  sufficient  to  find  and  remedy  identifiable  flaws 
in  the  instrument  and  associated  materials?   (5  points) 

(14)  Given  study  objectives,  how  adequate  were  procedures 
employed  to  choose  study  subjects,  assign  them  to 
treatments  (if  applicable) ,  and  encourage  high 
response  rates?   (5  points)   

(15)  Was  the  econometric  analysis  adequate?  (5  points)   .... 

(16)  How  adequate  are  the  written  materials  from  the 

study?   (5  points)    
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(17)   Are  there  other  concerns  relating  to  the  design  and 
execution  of  the  study  that  have  not  already  been 
addressed?   (10  points;  if  less  than  10  points  are 
assigned,  explain  other  concerns  under  "Written 
Comments"  below. )   


TOTAL  POINTS :  (If  the  study  has  fatal  flaws  enter  zero; 
otherwise  enter  the  total  of  points  assigned  in  Questions  2 
through  17 . )   


Written   Comments: 


